I am trying to do feature selection using Ant colony optimization (ACO) for a rainfall dataset. The implementation of the code is below
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
X = x
y = df_cap['PRECTOTCORR_SUM']
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define ACO feature selection function
def aco_feature_selection(X_train, X_test, y_train, y_test, num_ants=10, max_iter=50, alpha=1.0, beta=2.0, evaporation=0.5, q0=0.9):
num_features = X_train.shape[1]
pheromone = np.ones(num_features)
best_solution = None
best_accuracy = 0.0
# Run ACO algorithm
for i in range(max_iter):
ant_solutions = []
ant_accuracies = []
# Generate ant solutions
for ant in range(num_ants):
features = np.random.choice([0, 1], size=num_features, p=[1-pheromone,pheromone])
X_train_selected = X_train[:, features == 1]
X_test_selected = X_test[:, features == 1]
knn = KNeighborsClassifier()
knn.fit(X_train_selected, y_train)
y_pred = knn.predict(X_test_selected)
accuracy = accuracy_score(y_test, y_pred)
ant_solutions.append(features)
ant_accuracies.append(accuracy)
# Update best solution
if accuracy > best_accuracy:
best_solution = features
best_accuracy = accuracy
# Update pheromone levels
pheromone *= evaporation
for ant in range(num_ants):
features = ant_solutions[ant]
accuracy = ant_accuracies[ant]
if accuracy >= np.mean(ant_accuracies):
pheromone[features == 1] += alpha
else:
pheromone[features == 1] += beta
# Apply elitism
if best_solution is not None:
pheromone[best_solution == 1] += q0
return best_solution
# Run ACO feature selection
selected_features = aco_feature_selection(X_train, X_test, y_train, y_test)
# Print selected features
print("Selected features:", np.where(selected_features == 1)[0])
but I get this error
ValueError
Input In [175], in aco_feature_selection(X_train, X_test, y_train, y_test, num_ants, max_iter, alpha, beta, evaporation, q0)
26 # Generate ant solutions
27 for ant in range(num_ants):
---> 28 features = np.random.choice([0, 1], size=num_features, p=[1-pheromone,pheromone])
29 X_train_selected = X_train[:, features == 1]
30 X_test_selected = X_test[:, features == 1]
File mtrand.pyx:930, in numpy.random.mtrand.RandomState.choice()
ValueError: 'p' must be 1-dimensional
I suspect the issue comes list inside a list because it makes it 2-dimentional instead of 1-dimensional using something like flatten() throws this error
ValueError: 'a' and 'p' must have same size
how do I fix this?
The issue is that p is an array of probabilities and you are passing a 1 - array and an array into that argument. Without getting into the detail of the algorithm I can suggest that you need to choose a specific pheromone value for this feature.
And if you want to generate a series of 0 and 1 with given probabilities you need to iterate over pheromone
Related
I have a data frame with 3 categorical values (moisture, fertilizer, type) and one numerical value - biomass quantity. I created a regression model to predict the biomass quantity, based on these variables. I got the good accuracy.. The code is below
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
X = dce_stylized_fs.iloc[:, :-1].values
y = dce_stylized_fs.iloc[:, 3].values
labelencoder = LabelEncoder()
X[:, 1] = labelencoder.fit_transform(X[:, 1])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
ct = ColumnTransformer([("moisture", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
ct = ColumnTransformer([("fertilizer", OneHotEncoder(), [1])], remainder = 'passthrough')
X = ct.fit_transform(X)
ct = ColumnTransformer([("type", OneHotEncoder(), [2])], remainder = 'passthrough')
X = ct.fit_transform(X)
X = X[:,1:] #avoid dummy variable trap
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r = r2_score(y_test, y_pred)
What I've been wondering and can't figure out is how to pass some arbitrary data to a model and see prediction. For example, I would like to see what model would predict if I put moisture = 10 (this is like a scale or class), fertilizer = kjx, and type = chermozher (those values already appear in the test and train data, but not in this combination). I know that I need to format those arbitrary values to be in a format like X_train or X_test and call the function predict. But because I perform one-hot encoding I got 17 columns and I don't know which refers to which attribute. I can't see the column names as those are NumPy arrays. Can someone help me?
I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class.
Actually, I amusing this function
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)
but it gives unbalanced dataset! Any suggestion.
Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.
So you could do:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)
The trick here is that it starts from version 0.17 in sklearn.
From the documentation about the parameter stratify:
stratify : array-like or None (default is None)
If not None, data is split in a stratified fashion, using this as the labels array.
New in version 0.17: stratify splitting
You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
X_train=X[train_idx]
y_train=y[train_idx]
print(X_train)
# [[3 7]
# [2 4]]
print(y_train)
# [1 0]
If the classes are not balanced but you want the split to be balanced, then stratifying isn't going to help. There doesn't seem to be a method for doing balanced sampling in sklearn but it's kind of easy using basic numpy, for example a function like this might help you:
def split_balanced(data, target, test_size=0.2):
classes = np.unique(target)
# can give test_size as fraction of input data size of number of samples
if test_size<1:
n_test = np.round(len(target)*test_size)
else:
n_test = test_size
n_train = max(0,len(target)-n_test)
n_train_per_class = max(1,int(np.floor(n_train/len(classes))))
n_test_per_class = max(1,int(np.floor(n_test/len(classes))))
ixs = []
for cl in classes:
if (n_train_per_class+n_test_per_class) > np.sum(target==cl):
# if data has too few samples for this class, do upsampling
# split the data to training and testing before sampling so data points won't be
# shared among training and test data
splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl)))
ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class),
np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)])
else:
ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class,
replace=False))
# take same num of samples from all classes
ix_train = np.concatenate([x[:n_train_per_class] for x in ixs])
ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs])
X_train = data[ix_train,:]
X_test = data[ix_test,:]
y_train = target[ix_train]
y_test = target[ix_test]
return X_train, X_test, y_train, y_test
Note that if you use this and sample more points per class than in the input data, then those will be upsampled (sample with replacement). As a result, some data points will appear multiple times and this may have an effect on the accuracy measures etc. And if some class has only one data point, there will be an error. You can easily check the numbers of points per class for example with np.unique(target, return_counts=True)
Another approach is to over- or under- sample from your stratified test/train split. The imbalanced-learn library is quite handy for this, specially useful if you are doing online learning & want to guarantee balanced train data within your pipelines.
from imblearn.pipeline import Pipeline as ImbalancePipeline
model = ImbalancePipeline(steps=[
('data_balancer', RandomOverSampler()),
('classifier', SVC()),
])
This is my implementation that I use to get train/test data indexes
def get_safe_balanced_split(target, trainSize=0.8, getTestIndexes=True, shuffle=False, seed=None):
classes, counts = np.unique(target, return_counts=True)
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
if nPerClass > np.min(counts):
print("Insufficient data to produce a balanced training data split.")
print("Classes found %s"%classes)
print("Classes count %s"%counts)
ts = float(trainSize*np.min(counts)*len(classes)) / float(len(target))
print("trainSize is reset from %s to %s"%(trainSize, ts))
trainSize = ts
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
# get number of classes
nPerClass = int(nPerClass)
print("Data splitting on %i classes and returning %i per class"%(len(classes),nPerClass ))
# get indexes
trainIndexes = []
for c in classes:
if seed is not None:
np.random.seed(seed)
cIdxs = np.where(target==c)[0]
cIdxs = np.random.choice(cIdxs, nPerClass, replace=False)
trainIndexes.extend(cIdxs)
# get test indexes
testIndexes = None
if getTestIndexes:
testIndexes = list(set(range(len(target))) - set(trainIndexes))
# shuffle
if shuffle:
trainIndexes = random.shuffle(trainIndexes)
if testIndexes is not None:
testIndexes = random.shuffle(testIndexes)
# return indexes
return trainIndexes, testIndexes
This is the function I am using. You can adapt it and optimize it.
# Returns a Test dataset that contains an equal amounts of each class
# y should contain only two classes 0 and 1
def TrainSplitEqualBinary(X, y, samples_n): #samples_n per class
indicesClass1 = []
indicesClass2 = []
for i in range(0, len(y)):
if y[i] == 0 and len(indicesClass1) < samples_n:
indicesClass1.append(i)
elif y[i] == 1 and len(indicesClass2) < samples_n:
indicesClass2.append(i)
if len(indicesClass1) == samples_n and len(indicesClass2) == samples_n:
break
X_test_class1 = X[indicesClass1]
X_test_class2 = X[indicesClass2]
X_test = np.concatenate((X_test_class1,X_test_class2), axis=0)
#remove x_test from X
X_train = np.delete(X, indicesClass1 + indicesClass2, axis=0)
Y_test_class1 = y[indicesClass1]
Y_test_class2 = y[indicesClass2]
y_test = np.concatenate((Y_test_class1,Y_test_class2), axis=0)
#remove y_test from y
y_train = np.delete(y, indicesClass1 + indicesClass2, axis=0)
if (X_test.shape[0] != 2 * samples_n or y_test.shape[0] != 2 * samples_n):
raise Exception("Problem with split 1!")
if (X_train.shape[0] + X_test.shape[0] != X.shape[0] or y_train.shape[0] + y_test.shape[0] != y.shape[0]):
raise Exception("Problem with split 2!")
return X_train, X_test, y_train, y_test
I am trying to implement a simple Sklearn.linear_model.LinearRegression model and evaluate its performance through MSLE:
MSLE is based on SLE = (log(prediction + 1) - log(actual + 1))^2
I have something like 15 features, which all are normalized or standardized, all positive.
Though when I try to do a cross validation on my training data:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lin_reg = LinearRegression()
linreg_scores = cross_val_score(lin_reg, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_log_error')
I get the following error:
ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.
So I checked by hand doing a manual cross validation with sklearn.model_selection.KFold, in order to print the predicted values for each fold...
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from sklearn.base import clone
kf = KFold(n_splits=5, shuffle=True, random_state=5)
lin_reg = LinearRegression()
split_count = 0
for train_index, val_index in kf.split(X_train, y_train):
split_count += 1
clone_reg = clone(lin_reg)
X_tr = X_train.loc[train_index, :]
X_val = X_train.loc[val_index, :]
y_tr = y_train.loc[train_index]
y_val = y_train.loc[val_index]
clone_reg.fit(X_tr, y_tr)
pred = clone_reg.predict(X_val)
if any(pred<0):
print(split_count)
print(pred[pred<0])
The thing is, I do get negative predicted values, but they are all between [-1, 0]:
1
[-0.08642619]
3
[-0.2426673]
5
[-0.51744243]
So according to the MSLE formula, (y_predict + 1) should be positive, thus ln(y_predict + 1) should be mathematically correct.
Is there something that I am missing here?
Thanks a lot for your help, I'll obviously provide any additional info if needed!
I want to see the individual score of each fitted model to visualize the strength of cross validation (I am doing this to show my coworkers why cross validation is important).
I have a .csv file with 500 rows, 200 independent variables and 1 binary target. I defined skf to fold the data 5 times using StratifiedKFold.
My code looks like this:
X = data.iloc[0:500, 2:202]
y = data["target"]
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = svm.SVC(kernel = "linear")
Scores = [0] * 5
for i, j in skf.split(X, y):
X_train, y_train = X.iloc[i], y.iloc[i]
X_test, y_test = X.iloc[j], y.iloc[j]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
As you can see, I assigned a list of 5 zeroes to Scores. I would like to assign the clf.score(X_test, y_test) of each of the 5 predictions to the list. However, the indices i and j are not {1, 2, 3, 4, 5}. Rather, they are row numbers used to fold the X and y data frames.
How can I assign the test scores of each of the k fitted models into Scoreswithin this loop? Do I need a separate index for this?
I know using cross_val_score literally does all this and gives you a geometric average of the k scores. However, I want to show my coworkers what happens behind the cross validation functions that come in the sklearn library.
Thanks in advance!
If I understood the question, and you don't need any particular indexing for Scores:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
X = np.random.normal(size = (500, 200))
y = np.random.randint(low = 0, high=2, size=500)
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = SVC(kernel = "linear")
Scores = []
for i, j in skf.split(X, y):
X_train, y_train = X[i], y[i]
X_test, y_test = X[j], y[j]
clf.fit(X_train, y_train)
Scores.append(clf.score(X_test, y_test))
The result is:
>>>Scores
[0.5247524752475248, 0.53, 0.5, 0.51, 0.4444444444444444]
I found an issue with scikit confusion matrix.
I use confusion matrix inside KFold, then when the y_true and y_pred is 100% correct, the confusion matrix return a single number. This make my confusion matrix variable broke, because i add the result from confusion matrix in each fold. Any one have solution for this?
Here is my code
model = MultinomialNB()
kf = KFold(n_splits=10)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
cf += confusion_matrix(y_test, y_pred)
Thank You
The cleanest way is probably to pass a list of all possible classes in as the labels argument. Here is an example that shows the issue and it being resolved (based on spoofed data for the truth and predictions).
from sklearn.metrics import confusion_matrix
import numpy as np
y_test = np.array([1,1,1,1,1,0,0])
y_pred = np.array([0,1,1,1,1,0,0])
labels = np.unique(y_test)
cf = np.array([[0, 0], [0, 0]])
for indices in [ [0,1,2,3], [1,2,3] , [1,2,3,4,5,6]]:
cm1= confusion_matrix(y_test[indices], y_pred[indices])
cm2= confusion_matrix(y_test[indices], y_pred[indices], labels=labels)
print (cm1.shape == (2,2), cm2.shape == (2,2))
In the first subset, both classes appear; but in the second subset, only one class appears and so the cm1 matrix is not of size (2,2) (it comes out as (1,1)). But note that by indicating all potential classes in labels, cm2 is always ok.
If you already know that the labels can only be 0 or 1, you could just assign labels=[0,1], but using np.unique will be more robust.
You can check first if all pred_values are all equal to true_values. If it is the case, then just increment your 00 and 11 confusion matrix values by the true_values (or pred_values).
X = pd.DataFrame({'f1': [1]*10 + [0]*10,
'f2': [3]*10 + [10]*10}).values
y = np.array([1]*10 + [0]*10)
model = MultinomialNB()
kf = KFold(n_splits=5)
cf = np.array([[0, 0], [0, 0]])
for train_index, test_index in kf.split(X):
x_train, x_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
if all(y_test == y_pred): # if perfect prediction
cf[0][0] += sum(y_pred == 0) # increment by number of 0 values
cf[1][1] += sum(y_pred == 1) # increment by number of 1 values
else:
cf += confusion_matrix(y_test, y_pred) # else add cf values
Result of print(cf)
>> [10 0]
[0 10]
Be careful to overfitting