Feature selection drastically decreases accuracy - python

Ive been playing around with Pyswarms specifically with discrete.binaryPSO to perform feature selection as it is an optimisation technique that helps perform feature subset selection to improve classifier performance. (https://pyswarms.readthedocs.io/en/development/examples/feature_subset_selection.html)<- link to pyswarms.
My dataset is based on text data with a corresponding label(identified in 1’s and 0’s). Upon preprocessing, i incorporated countvectorizer and tfidftransformer to the text data.
However a simple machine learning classifier using sklearn predicts a much higher accuracy in comparison to incorporating pyswarms. No matter what dataset i use, pre-processing techniques and functions i add when incorporating discrete.binarypso my accuracy, precision and recall is lower than a simple machine learning classification using SKlearn.
My code is attached below any help on the situation is appreciated:
# Create an instance of the classifier
classifier = LogisticRegression()
# Define objective function
# Define objective function
def f_per_particle(m, alpha):
total_features = training_data.shape[1]
# Get the subset of the features from the binary mask
if np.count_nonzero(m) == 0:
X_subset = training_data
else:
X_subset = training_data[:,m==1]
# Perform classification and store performance in P
classifier.fit(X_subset, y_train)
P = (classifier.predict(X_subset) == y_train).mean()
# Compute for the objective function
j = (alpha * (1.0 - P)
+ (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features)))
return j
def f(x, alpha=0.88):
"""Higher-level method to do classification in the
whole swarm.
Inputs
------
x: numpy.ndarray of shape (n_particles, dimensions)
The swarm that will perform the search
Returns
-------
numpy.ndarray of shape (n_particles, )
The computed loss for each particle
"""
n_particles = x.shape[0]
j = [f_per_particle(x[i], alpha) for i in range(n_particles)]
return np.array(j)
options = {'c1':0.5, 'c2': 0.5,'w':0.9,'k': 10,'p':2}
# Call instance of PSO
dimensions = training_data.shape[1] # dimensions should be the number of features
optimizer = ps.discrete.BinaryPSO(n_particles=10, dimensions=dimensions, options=options)
# Perform optimization
cost, pos = optimizer.optimize(f, iters=10)
print('selected features = ' + str(sum((pos == 1)*1)) + '/' + str(len(pos)))
classifier.fit(training_data, y_train)
print('accuracy before FS = ' + str(accuracy_score(y_test, classifier.predict(testing_data), normalize = True)*100))
X_subset = training_data[:,pos==1]
classifier.fit(X_subset, y_train)
print('accuracy after FS = ' + str(accuracy_score(y_test, classifier.predict(testing_data[:,pos==1]), normalize = True)*100))

Since feature selection is not yielding better performance, I would recommend to use all the features in the machine learning model and see the impact of each feature. You may find
https://shap.readthedocs.io/en/latest/index.html[SHAP][1]
to be helpful for explaining the output and then look at the significance of each feature for this purpose.

Related

Optimizing conditional multiclass softmax objective function in XGBoost

I have successfully implemented a custom multiclass softmax function function in XGBoost based on this tutorial. The reason for customization is that the classes I want to predict are conditional on some data inputs - i.e. of the 24 possible classes being predicted, only a certain subset are valid. valid_transitions are lists of indices corresponding to classes we want to make predictions on and invalid_transitions are the inverse set of indices.
I have implemented .fit() and .predict_proba() such that they take valid_transitions and invalid_transitions as arguments which tells softprob_obj() and softmax()which classes to null out during training and prediction.
def softmax(x, valid_transitions, invalid_transitions):
for i in range(len(x)):
e = np.exp(x[i,valid_transitions[i]])
x[i, valid_transitions[i]] = e/np.sum(e)
x[i, invalid_transitions[i]] = 0
return x
def softprob_obj(labels, predt, data, valid_transitions, invalid_transitions):
'''Loss function. Computing the gradient and approximated hessian (diagonal).
Reimplements the `multi:softprob` inside XGBoost.
'''
kRows = len(data)
kClasses = len(np.unique(labels))
# The prediction is of shape (rows, classes), each element in a row
# represents a raw prediction (leaf weight, hasn't gone through softmax
# yet). In XGBoost 1.0.0, the prediction is transformed by a softmax
# function, fixed in later versions.
assert predt.shape == (kRows, kClasses)
eps = 1e-6
# compute the gradient and hessian, slow iterations in Python, only
# suitable for demo. Also the one in native XGBoost core is more robust to
# numeric overflow as we don't do anything to mitigate the `exp` in
# `softmax` here.
probs = softmax(predt, valid_transitions, invalid_transitions)
labels = labels.astype(int)
hess = np.maximum((2.0 * probs * (1.0 - probs)), eps)
probs[np.arange(len(probs)),labels] -= 1
# Right now (XGBoost 1.0.0), reshaping is necessary
grad = probs.reshape((kRows * kClasses, 1))
hess = hess.reshape((kRows * kClasses, 1))
return grad, hess
This works, but is pretty slow in training, presumably because the core xgboost functions I'm replacing are not written in python. I made some attempts to try to vectorize the calculation in numpy to avoid the for loop in softmax(), but ran into some issues with the ragged arrays that valid_transition and invalid_transition create. Was wondering if anyone had any ideas on how to optimize this within python. Thanks!

Why can't I get the result I got with the sklearn LogisticRegression with the coefficients_sgd method?

from math import exp
import numpy as np
from sklearn.linear_model import LogisticRegression
I used code below from How To Implement Logistic Regression From Scratch in Python
def predict(row, coefficients):
yhat = coefficients[0]
for i in range(len(row)-1):
yhat += coefficients[i + 1] * row[i]
return 1.0 / (1.0 + exp(-yhat))
def coefficients_sgd(train, l_rate, n_epoch):
coef = [0.0 for i in range(len(train[0]))]
for epoch in range(n_epoch):
sum_error = 0
for row in train:
yhat = predict(row, coef)
error = row[-1] - yhat
sum_error += error**2
coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
for i in range(len(row)-1):
coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
return coef
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
l_rate = 0.3
n_epoch = 100
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)
[-0.39233141593823756, 1.4791536027917747, -2.316697087065274]
x = np.array(dataset)[:,:2]
y = np.array(dataset)[:,2]
model = LogisticRegression(penalty="none")
model.fit(x,y)
print(model.intercept_.tolist() + model.coef_.ravel().tolist())
[-3.233238244349982, 6.374828107647225, -9.631487530388092]
What should I change to get the same or closer coefficients ? How can I establish initial coefficients , learning rate , n_epoch ?
Well, there are many nuances here 🙂
First, recall that estimating coefficients of logistic regression with (negative) log-likelihood is possible using various optimization methods, including SGD you implemented, but there is no exact, closed-form solution. So even if you implement an exact copy of scikit-learn's LogisticRegression, you will need to set the same hyperparameters (number of epochs, learning rate, etc.) and random state to obtain the same coefficients.
Second, LogisticRegression offers five different optimization methods (solver parameter). You run LogisticRegression(penalty="none") with its default parameters and the default for solver is 'lbfgs', not SGD; so depending on your data and hyperparameters, you may get significantly different results.
What should I change to get the same or closer coefficients ?
I would suggest comparing your implementation with SGDClassifier(loss='log') first, since LogisticRegression does not offer SGD solver. Although keep in mind that scikit-learn's implementation is more sophisticated, in particular having more hyperparameters for early stopping like tol.
How can I establish initial coefficients, learning rate, n_epoch?
Typically, coefficients for SGD are initialized randomly (e.g., uniform(-1/(2n), 1/(2n))), using some data statistics (e.g., dot(y, w)/(dot(w, w) for every coefficient w), or with pre-trained model's parameters. On the contrary, there is no golden rule for learning rate or number of epochs. Usually, we set a big number of epochs and some other stopping criterion (e.g., whether norm between current and previous coefficients is smaller than some small tol), a moderate learning rate, and every iteration we reduce the learning rate following some rule (see learning_rate parameter of SGDClassifier or User Guide) and check the stopping criterion.

Tensorflow model architecture for sparse dataset

I have a regression dataset where approximately 95% of the target variables are zeros (the other 5% are between 1 and 30) and I am trying to design a Tensorflow model to model that data. I am thinking of implementing a model that combines a classifier and a regressor (check the output of the classifier submodel, if it's less than a threshold then pass it to the regression submodel). I have the intuition that this should be built using the functional API But I couldn't find helpful resources on that. Any ideas?
Here is the code that generates the data that I am using to replicate the problem:
n = 10000
zero_percentage = 0.95
zeros = np.zeros(round(n * zero_percentage))
non_zeros = np.random.randint(1,30,size=round(n * (1- zero_percentage)))
y = np.concatenate((zeros,non_zeros))
np.random.shuffle(y)
a = 50
b = 10
x = np.array([np.random.randint(31,60) if element == 0 else (element - b) / a for element in y])
y_classification = np.array([0 if element == 0 else 1 for element in y])
Note: I experimented with probabilistic models (Poisson regression and regression with a discretized logistic mixture distribution), and they provided good results but the training was unstable (loss diverges very often).
Instead of trying to find some heuristic to balance the training between the zero values and the others, you might want to try some input preprocessing method that can handle imbalanced training sets better (usually by mapping to another space before running the model, then doing the inverse with the results); for example, an embedding layer. Alternatively, normalize the values to a small range (like [-1, 1]) and apply an activation function before evaluating the model on the data.

Data preparation before RFECV or any other feature selection

I'm trying to figure out if it is wise to remove highly correlated and negatively correlated features before feature selection. Here's a snapshot of my code
def find_correlation(data, threshold=0.9, remove_negative=False):
corr_mat = data.corr()
if remove_negative:
corr_mat = np.abs(corr_mat)
corr_mat.loc[:, :] = np.tril(corr_mat, k=-1)
already_in = set()
result = []
for col in corr_mat:
perfect_corr = corr_mat[col][corr_mat[col] > threshold].index.tolist()
if perfect_corr and col not in already_in:
already_in.update(set(perfect_corr))
perfect_corr.append(col)
result.append(perfect_corr)
select_nested = [f[1:] for f in result]
select_flat = [i for j in select_nested for i in j]
return select_flat
corrFeatList = find_correlation(x)
fpd = x.drop(corrFeatList,axis = 1 )
fpd['label'] = catlabel
fpd = fpd[fpd['label'].notnull()]
Features = np.array(fpd.iloc[:,:-1])
Labels = np.array(fpd.iloc[:,-1])
hpd = fpd.iloc[:,:-1]
headerName = hpd.columns
#Scale first
#Scaling normalisation
scaler = preprocessing.StandardScaler()
Features = scaler.fit_transform(Features)
#RFECV logReg first
## Reshape the Label array
Labels = Labels.reshape(Labels.shape[0],)
## Set folds for nested cross validation
nr.seed(988)
feature_folds = ms.KFold(n_splits=10, shuffle = True)
## Define the model
logistic_mod = linear_model.LogisticRegression(C = 10, class_weight = "balanced")
## Perform feature selection by CV with high variance features only
nr.seed(6677)
selector = fs.RFECV(estimator = logistic_mod, cv = feature_folds)
selector = selector.fit(Features, Labels)
Features = selector.transform(Features)
print('Best features :', headerName[selector.support_])
So i've tried it with and without dropping the correlated features and have gotten completely different features. Does RFECV and other features selection (dimensionality reduction methods) take into account these highly correlated features? Am i doing the right thing here? Lastly if removing high threshold features is a good idea should i scale before doing this. Thank you.
Kevin
RFECV simply takes your original data, crossvalidates the model and drops the least significant feature with significance provided with your classifier/regressor.
Then it does the same with all the retaied features recursively.
So it doesn't explicitly aware of linear correlations.
At the same time high correlations of the features doesn't imply that one of them is the best candidate to remove. Highly correlated feature can bear some usefull data information, for example it can have less variance than the retaiend ones.
Dimensionality reduction doesn't imply removal of highly correlated features in general case, however some linear models like PCA implicitly do this.

Python Logistic regression and future samples

I'm learning Logistic regression in Python and have managed to fit a model to my existing data (stock market data), and the predictions produce a nice result.
But I do not know how to convert that predictive model in a way I can apply it to future data. Ie is there a y=ax+b algo I can use to input future samples? How do I use the 'model'? How does one use the prediction for subsequent data? Or am I off track here - is Logistic regression not applied in this manner?
When you train the Logistic Regression, you learn the parameters a and b in y = ax + b. So, after training, a and b are known and can be used to solve the equation y = ax + b.
I don't know what exact Python packages you did use to train your model and how many classes you have, but if it would be, let's say, numpy and 2 classes, the prediction function could look like this:
import numpy as np
def sigmoid(z):
"""
Compute the sigmoid of z.
Arguments:
z: a scalar or numpy array
Return:
s: sigmoid(z)
"""
s = 1 / (1 + np.exp(-z))
return s
def predict(w, b, X):
'''
Predict whether the label is 0 or 1 using learned logistic
regression parameters (w, b).
Arguments:
w: weights
b: bias
X: data to predict
Returns:
Y_pred: a numpy array (vector) containing all predictions (0/1)
for the examples in X
'''
m = X.shape[1] # number of instances in X
Y_pred = np.zeros((1,m))
w = w.reshape(X.shape[0], 1)
# Apply the same activation function which you applied during
# training, in this case it is a sigmoid
A = sigmoid(np.dot(w.T, X) + b)
for i in range(A.shape[1]):
# Convert probabilities A[0,i] to actual predictions p[0,i]
if A[0, i] > 0.5:
Y_pred[0, i] = 1
else:
Y_pred[0, i] = 0
return Y_pred

Categories