I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class.
Actually, I amusing this function
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)
but it gives unbalanced dataset! Any suggestion.
Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.
So you could do:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)
The trick here is that it starts from version 0.17 in sklearn.
From the documentation about the parameter stratify:
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the labels array.
New in version 0.17: stratify splitting
You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
X_train=X[train_idx]
y_train=y[train_idx]
print(X_train)
# [[3 7]
# [2 4]]
print(y_train)
# [1 0]
If the classes are not balanced but you want the split to be balanced, then stratifying isn't going to help. There doesn't seem to be a method for doing balanced sampling in sklearn but it's kind of easy using basic numpy, for example a function like this might help you:
def split_balanced(data, target, test_size=0.2):
classes = np.unique(target)
# can give test_size as fraction of input data size of number of samples
if test_size<1:
n_test = np.round(len(target)*test_size)
else:
n_test = test_size
n_train = max(0,len(target)-n_test)
n_train_per_class = max(1,int(np.floor(n_train/len(classes))))
n_test_per_class = max(1,int(np.floor(n_test/len(classes))))
ixs = []
for cl in classes:
if (n_train_per_class+n_test_per_class) > np.sum(target==cl):
# if data has too few samples for this class, do upsampling
# split the data to training and testing before sampling so data points won't be
# shared among training and test data
splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl)))
ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class),
np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)])
else:
ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class,
replace=False))
# take same num of samples from all classes
ix_train = np.concatenate([x[:n_train_per_class] for x in ixs])
ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs])
X_train = data[ix_train,:]
X_test = data[ix_test,:]
y_train = target[ix_train]
y_test = target[ix_test]
return X_train, X_test, y_train, y_test
Note that if you use this and sample more points per class than in the input data, then those will be upsampled (sample with replacement). As a result, some data points will appear multiple times and this may have an effect on the accuracy measures etc. And if some class has only one data point, there will be an error. You can easily check the numbers of points per class for example with np.unique(target, return_counts=True)
Another approach is to over- or under- sample from your stratified test/train split. The imbalanced-learn library is quite handy for this, specially useful if you are doing online learning & want to guarantee balanced train data within your pipelines.
from imblearn.pipeline import Pipeline as ImbalancePipeline
model = ImbalancePipeline(steps=[
('data_balancer', RandomOverSampler()),
('classifier', SVC()),
])
This is my implementation that I use to get train/test data indexes
def get_safe_balanced_split(target, trainSize=0.8, getTestIndexes=True, shuffle=False, seed=None):
classes, counts = np.unique(target, return_counts=True)
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
if nPerClass > np.min(counts):
print("Insufficient data to produce a balanced training data split.")
print("Classes found %s"%classes)
print("Classes count %s"%counts)
ts = float(trainSize*np.min(counts)*len(classes)) / float(len(target))
print("trainSize is reset from %s to %s"%(trainSize, ts))
trainSize = ts
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
# get number of classes
nPerClass = int(nPerClass)
print("Data splitting on %i classes and returning %i per class"%(len(classes),nPerClass ))
# get indexes
trainIndexes = []
for c in classes:
if seed is not None:
np.random.seed(seed)
cIdxs = np.where(target==c)[0]
cIdxs = np.random.choice(cIdxs, nPerClass, replace=False)
trainIndexes.extend(cIdxs)
# get test indexes
testIndexes = None
if getTestIndexes:
testIndexes = list(set(range(len(target))) - set(trainIndexes))
# shuffle
if shuffle:
trainIndexes = random.shuffle(trainIndexes)
if testIndexes is not None:
testIndexes = random.shuffle(testIndexes)
# return indexes
return trainIndexes, testIndexes
This is the function I am using. You can adapt it and optimize it.
# Returns a Test dataset that contains an equal amounts of each class
# y should contain only two classes 0 and 1
def TrainSplitEqualBinary(X, y, samples_n): #samples_n per class
indicesClass1 = []
indicesClass2 = []
for i in range(0, len(y)):
if y[i] == 0 and len(indicesClass1) < samples_n:
indicesClass1.append(i)
elif y[i] == 1 and len(indicesClass2) < samples_n:
indicesClass2.append(i)
if len(indicesClass1) == samples_n and len(indicesClass2) == samples_n:
break
X_test_class1 = X[indicesClass1]
X_test_class2 = X[indicesClass2]
X_test = np.concatenate((X_test_class1,X_test_class2), axis=0)
#remove x_test from X
X_train = np.delete(X, indicesClass1 + indicesClass2, axis=0)
Y_test_class1 = y[indicesClass1]
Y_test_class2 = y[indicesClass2]
y_test = np.concatenate((Y_test_class1,Y_test_class2), axis=0)
#remove y_test from y
y_train = np.delete(y, indicesClass1 + indicesClass2, axis=0)
if (X_test.shape[0] != 2 * samples_n or y_test.shape[0] != 2 * samples_n):
raise Exception("Problem with split 1!")
if (X_train.shape[0] + X_test.shape[0] != X.shape[0] or y_train.shape[0] + y_test.shape[0] != y.shape[0]):
raise Exception("Problem with split 2!")
return X_train, X_test, y_train, y_test
Related
I am trying to do feature selection using Ant colony optimization (ACO) for a rainfall dataset. The implementation of the code is below
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
X = x
y = df_cap['PRECTOTCORR_SUM']
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define ACO feature selection function
def aco_feature_selection(X_train, X_test, y_train, y_test, num_ants=10, max_iter=50, alpha=1.0, beta=2.0, evaporation=0.5, q0=0.9):
num_features = X_train.shape[1]
pheromone = np.ones(num_features)
best_solution = None
best_accuracy = 0.0
# Run ACO algorithm
for i in range(max_iter):
ant_solutions = []
ant_accuracies = []
# Generate ant solutions
for ant in range(num_ants):
features = np.random.choice([0, 1], size=num_features, p=[1-pheromone,pheromone])
X_train_selected = X_train[:, features == 1]
X_test_selected = X_test[:, features == 1]
knn = KNeighborsClassifier()
knn.fit(X_train_selected, y_train)
y_pred = knn.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
ant_solutions.append(features)
ant_accuracies.append(accuracy)
# Update best solution
if accuracy > best_accuracy:
best_solution = features
best_accuracy = accuracy
# Update pheromone levels
pheromone *= evaporation
for ant in range(num_ants):
features = ant_solutions[ant]
accuracy = ant_accuracies[ant]
if accuracy >= np.mean(ant_accuracies):
pheromone[features == 1] += alpha
else:
pheromone[features == 1] += beta
# Apply elitism
if best_solution is not None:
pheromone[best_solution == 1] += q0
return best_solution
# Run ACO feature selection
selected_features = aco_feature_selection(X_train, X_test, y_train, y_test)
# Print selected features
print("Selected features:", np.where(selected_features == 1)[0])
but I get this error
ValueError
Input In [175], in aco_feature_selection(X_train, X_test, y_train, y_test, num_ants, max_iter, alpha, beta, evaporation, q0)
26 # Generate ant solutions
27 for ant in range(num_ants):
---> 28 features = np.random.choice([0, 1], size=num_features, p=[1-pheromone,pheromone])
29 X_train_selected = X_train[:, features == 1]
30 X_test_selected = X_test[:, features == 1]
File mtrand.pyx:930, in numpy.random.mtrand.RandomState.choice()
ValueError: 'p' must be 1-dimensional
I suspect the issue comes list inside a list because it makes it 2-dimentional instead of 1-dimensional using something like flatten() throws this error
ValueError: 'a' and 'p' must have same size
how do I fix this?
The issue is that p is an array of probabilities and you are passing a 1 - array and an array into that argument. Without getting into the detail of the algorithm I can suggest that you need to choose a specific pheromone value for this feature.
And if you want to generate a series of 0 and 1 with given probabilities you need to iterate over pheromone
I have loaded the CIFAR10 dataset but I want to divide it into multiple splits.
Here is how I downloaded the dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
Then I used ShuffleSplit to create a generator to split the data like this:
from sklearn.model_selection import ShuffleSplit
rs = ShuffleSplit(n_splits=3, test_size=0.1, random_state=0)
splits = rs.split(x_train)
I know I can iterate over the generated splits using:
for train_index, test_index in splits:
#train_index is a np array which hold the indies
print("TRAIN:", train_index, "TEST:", test_index)
Assuming I want to have at the end.
x_train1, y_train1, x_train2, y_train2, x_train3, y_train3
How can I divide the data based on the generated indices such that one training split which contain both the training and the testing indices?
I tried combining the indices into list or contacte the arrays but it did not work.
I was able to solve the problem by using a different approach the code is below:
partitions_train_x = []
partitions_train_y = []
partitions_test_x = []
partitions_test_y = []
x = np.arange(len(y_train))
np.random.shuffle(x)
indices = np.split(x, num_partitions)
for data_indices in zip(indices):
x = x_train[data_indices]
y = y_train[data_indices]
partions_train_x.append(x)
partitions_train_y.append(y)
x = np.arange(len(y_test))
indices = np.split(x, num_partitions)
for data_indices in zip(indices):
x = x_test[data_indices]
y = y_test[data_indices]
partitions_test_x.append(x)
partitions_test_y.append(y)
I know this may not be the best way to do, but it works.
A better way to do it :)
num_shreds = 10
shred_size = len(X_train)//num_shreds
X_train, y_train = shuffle(X_train, y_train)
shred_X = [X_train[i:i + shred_size] for i in range(0, shred_size* num_shreds, shred_size)]
shred_y = [y_train[i:i + shred_size] for i in range(0, shred_size* num_shreds, shred_size)]
I have list of lists like:
list = [[[bad, good],"Antonyms"], [[good, nice],"Synonyms"]]
I need to split this data into train, development and test:60%, 20%, 20%
And I have no idea how to do it. The similar questions doesnt give me an answer for my case. Maybe somboody have an idea?
Thank you
I am assuming that Antonyms, synonyms are some kind of categories for you. Using train_test_split from sklearn we can do the data splitting.
Note: I have changed the bad, good,etc into string. Hope that is the case with your dataset as well.
import numpy as np
from sklearn.model_selection import train_test_split
my_list = [[['bad', 'good'],"Antonyms"], [['good', 'nice'],"Synonyms"],
[['good', 'nice'],"Synonyms"],[['good', 'nice'],"Synonyms"],
[['good', 'nice'],"Synonyms"]]
data=np.array(my_list)
print(data.shape)
#(5, 2)
X,y=data[:,0],data[:,1]
#split the data to get 60% train and 40% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
#split the test again to get 20% dev and 20% test
X_dev, X_test, y_dev, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
print(y_train.shape,y_dev.shape,y_test.shape)
#(3,) (1,) (1,)
Train, development and test will be the three final lists generated.
import random
l = [[['bad0', 'good0'], 'Antonyms0'], [['good0', 'nice0'], 'Synonyms0'],
[['bad1', 'good1'], 'Antonyms1'], [['good1', 'nice1'], 'Synonyms1'],
[['bad2', 'good2'], 'Antonyms2'], [['good2', 'nice2'], 'Synonyms2'],
[['bad3', 'good3'], 'Antonyms3'], [['good3', 'nice3'], 'Synonyms3'],
]
#Initializing the three lists.
train = []
development = []
test = []
r = random.uniform(0, 1) # Random number generator between 0 & 1.
for i in l:
if r <= 0.6:
train = train + i
elif r <= 0.8:
development = development + i
else:
test = test + i
train
[['good1', 'nice1'],
'Synonyms1',
['bad3', 'good3'],
'Antonyms3',
['good3', 'nice3'],
'Synonyms3']
development
[['bad0', 'good0'],
'Antonyms0',
['good0', 'nice0'],
'Synonyms0',
['bad1', 'good1'],
'Antonyms1',
['bad2', 'good2'],
'Antonyms2',
['good2', 'nice2'],
'Synonyms2']
test
[]
I have 2 numpy arrays X and Y, with shape X: [4750, 224, 224, 3] and Y: [4750,1].
X is the training dataset and Y is the correct output label for each entry.
I want to split the data into train and test so as to validate my machine learning model. Therefore, I want to split them randomly so that they both have the correct ordering after random split is applied on X and Y. ie- every row of X is correctly has its corresponding label unchanged after the split.
How can I achieve the above objective ?
This is how I would do it
def split(x, y, train_ratio=0.7):
x_size = x.shape[0]
train_size = int(x_size * train_ratio)
test_size = x_size - train_size
train_indices = np.random.choice(x_size, size=train_size, replace=False)
mask = np.zeros(x_size, dtype=bool)
mask[train_indices] = True
x_train, y_train = x[mask], y[mask]
x_test, y_test = x[~mask], y[~mask]
return (x_train, y_train), (x_test, y_test)
I simply choose the required number of indices I need (randomly) for my train set, remaining will be for the test set.
Then use a mask to select the train and test samples.
You can also use the scikit-learn train_test_split to split your data using just 2 lines of code :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)
sklearn.model_selection.train_test_split is a good choice!
But to craft one of your own
import numpy as np
def my_train_test_split(X, Y, train_ratio=0.8):
"""return X_train, Y_train, X_test, Y_test"""
n = X.shape[0]
split = int(n * train_ratio)
index = np.arange(n)
np.random.shuffle(index)
return X[index[:split]], Y[index[:split]], X[index[split:]], Y[index[split:]]
I have a training data set in matrix form of dimensions 5000 x 3027 (CIFAR-10 data set). Using array_split in numpy, I partitioned it into 5 different parts, and I want to select just one of the parts as the cross validation fold. However my problem comes when I use something like
XTrain[[Indexes]] where indexes is an array like [0,1,2,3], because doing this gives me a 3D tensor of dimensions 4 x 1000 x 3027, and not a matrix. How do I collapse the "4 x 1000" into 4000 rows, to get a matrix of 4000 x 3027?
for fold in range(len(X_train_folds)):
indexes = np.delete(np.arange(len(X_train_folds)), fold)
XTrain = X_train_folds[indexes]
X_cv = X_train_folds[fold]
yTrain = y_train_folds[indexes]
y_cv = y_train_folds[fold]
classifier.train(XTrain, yTrain)
dists = classifier.compute_distances_no_loops(X_cv)
y_test_pred = classifier.predict_labels(dists, k)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct/num_test)
k_to_accuracy[k] = accuracy
Perhaps you can try this instead (new to numpy so if I am doing something inefficient/wrong, would be happy to be corrected)
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
k_to_accuracies = {}
for k in k_choices:
k_to_accuracies[k] = []
for i in range(num_folds):
training_data, test_data = np.concatenate(X_train_folds[:i] + X_train_folds[i+1:]), X_train_folds[i]
training_labels, test_labels = np.concatenate(y_train_folds[:i] + y_train_folds[i+1:]), y_train_folds[i]
classifier.train(training_data, training_labels)
predicted_labels = classifier.predict(test_data, k)
k_to_accuracies[k].append(np.sum(predicted_labels == test_labels)/len(test_labels))
I would suggest using scikit-learn package. It already comes with plenty of common machine learning tools, such as K-fold cross-validation generator:
>>> from sklearn.cross_validation import KFold
>>> X = # your data [samples x features]
>>> y = # gt labels
>>> kf = KFold(X.shape[0], n_folds=5)
And then, iterate through kf:
>>> for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# do something
The above loop will be executed n_folds times, each time with different training and testing indexes.