Data preparation before RFECV or any other feature selection - python

I'm trying to figure out if it is wise to remove highly correlated and negatively correlated features before feature selection. Here's a snapshot of my code
def find_correlation(data, threshold=0.9, remove_negative=False):
corr_mat = data.corr()
if remove_negative:
corr_mat = np.abs(corr_mat)
corr_mat.loc[:, :] = np.tril(corr_mat, k=-1)
already_in = set()
result = []
for col in corr_mat:
perfect_corr = corr_mat[col][corr_mat[col] > threshold].index.tolist()
if perfect_corr and col not in already_in:
select_nested = [f[1:] for f in result]
select_flat = [i for j in select_nested for i in j]
return select_flat
corrFeatList = find_correlation(x)
fpd = x.drop(corrFeatList,axis = 1 )
fpd['label'] = catlabel
fpd = fpd[fpd['label'].notnull()]
Features = np.array(fpd.iloc[:,:-1])
Labels = np.array(fpd.iloc[:,-1])
hpd = fpd.iloc[:,:-1]
headerName = hpd.columns
#Scale first
#Scaling normalisation
scaler = preprocessing.StandardScaler()
Features = scaler.fit_transform(Features)
#RFECV logReg first
## Reshape the Label array
Labels = Labels.reshape(Labels.shape[0],)
## Set folds for nested cross validation
feature_folds = ms.KFold(n_splits=10, shuffle = True)
## Define the model
logistic_mod = linear_model.LogisticRegression(C = 10, class_weight = "balanced")
## Perform feature selection by CV with high variance features only
selector = fs.RFECV(estimator = logistic_mod, cv = feature_folds)
selector =, Labels)
Features = selector.transform(Features)
print('Best features :', headerName[selector.support_])
So i've tried it with and without dropping the correlated features and have gotten completely different features. Does RFECV and other features selection (dimensionality reduction methods) take into account these highly correlated features? Am i doing the right thing here? Lastly if removing high threshold features is a good idea should i scale before doing this. Thank you.

RFECV simply takes your original data, crossvalidates the model and drops the least significant feature with significance provided with your classifier/regressor.
Then it does the same with all the retaied features recursively.
So it doesn't explicitly aware of linear correlations.
At the same time high correlations of the features doesn't imply that one of them is the best candidate to remove. Highly correlated feature can bear some usefull data information, for example it can have less variance than the retaiend ones.
Dimensionality reduction doesn't imply removal of highly correlated features in general case, however some linear models like PCA implicitly do this.


SVM classifier n_samples, n_splits problem sklearn Python

I'm trying to predict volatility one step ahead with an SVM model based on O'Reilly book example (Machine Learning for Financial Risk Management with Python). When I copy exactly the example (with S&P500 data) it works well but now I'm having troubles with this chunk of code with a particular fund returns data:
# returns
r = np.array([ nan, 0.0013933 , 0.00118874, 0.00076462, 0.00168565,
-0.00018507, -0.00390753, 0.00307275, -0.00351472])
# horizon
t = 252
# mean of returns
mu = r.mean()
# critical value
z = norm.ppf(0.95)
# realized volatility
vol = r.rolling(5).std()
vol = pd.DataFrame(vol)
vol.reset_index(drop=True, inplace=True)
r_svm = r ** 2
r_svm = r_svm.reset_index()
# inputs X (returns and realized volatility)
X = pd.concat([vol, r_svm], axis=1, ignore_index=True)
X = X.dropna().copy()
X = X.reset_index()
X.drop([1, 'index'], axis=1, inplace=True)
# labels y realized volatility shifted 1 period onward
vol = vol.dropna().reset_index()
vol.drop('index', axis=1, inplace=True)
# linear kernel
svr_lin = SVR(kernel='linear')
# hyperparameters grid
para_grid = {'gamma': sp_rand(),
'C': sp_rand(),
'epsilon': sp_rand()}
# svm classifier (regression?)
clf = RandomizedSearchCV(svr_lin, para_grid)[:-1].dropna().values,
# prediction
n_vol = clf.predict(X.iloc[-1:])
The raised error is:
ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=3.
The code works with longer returns series so I assume that the problem is the length of the array but I can't figure out how to solve it. can someone help me with that?
This error is getting raised because you use RandomizedSearchCV with default cv parameter.
By default RandomizedSearchCV is running 5-folds cross-validation to find the best hyperparameters for the model.
5-folds cross-validation means splitting your training data into 5 subsets and training 5 different models based on these splits.
Looks like you have less than 5 objects in your training set, so splitting your data into 5 folds isn't possible.
To fix the issue you should either add more data or decrease number of folds for the RandomizedSearchCV by adding cv parameter:
clf = RandomizedSearchCV(svr_lin, para_grid, cv=2)
I'd recommend to collect more data, since 4 data points most likely won't be enough to make the model accurate or predictive.

Problems with inverse_transform scaled predictions and y_test in multi-step, multi-variate LSTM

I have built a multi-step, multi-variate LSTM model to predict the target variable 5 days into the future with 5 days of look-back. The model runs smooth (even though it has to be further improved), but I cannot correctly invert the transformation applied, once I get my predictions.
I have seen on the web that there are many ways to pre-process and transform data. I decided to follow these steps:
Data fetching and cleaning
df =['^GSPC', '^GDAXI', 'CL=F', 'AAPL'], period='5y', interval='1d')['Adj Close'];
df.dropna(axis=0, inplace=True)
Data set table
Split the data set into train and test
size = int(len(df) * 0.80)
df_train = df.iloc[:size]
df_test = df.iloc[size:]
Scaled train and test sets separately with MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0,1))
df_train_sc = scaler.fit_transform(df_train)
df_test_sc = scaler.transform(df_test)
Creation of 3D X and y time-series compatible with the LSTM model
I borrowed the following function from this article
def create_X_Y(ts: np.array, lag=1, n_ahead=1, target_index=0) -> tuple:
A method to create X and Y matrix from a time series array for the training of
deep learning models
# Extracting the number of features that are passed from the array
n_features = ts.shape[1]
# Creating placeholder lists
X, Y = [], []
if len(ts) - lag <= 0:
for i in range(len(ts) - lag - n_ahead):
Y.append(ts[(i + lag):(i + lag + n_ahead), target_index])
X.append(ts[i:(i + lag)])
X, Y = np.array(X), np.array(Y)
# Reshaping the X array to an RNN input shape
X = np.reshape(X, (X.shape[0], lag, n_features))
return X, Y
#In this example let's assume that the first column (AAPL) is the target variable.
trainX,trainY = create_X_Y(df_train_sc,lag=5, n_ahead=5, target_index=0)
testX,testY = create_X_Y(df_test_sc,lag=5, n_ahead=5, target_index=0)
Model creation
def build_model(optimizer):
grid_model = Sequential()
grid_model.add(LSTM(64,activation='tanh', return_sequences=True,input_shape=(trainX.shape[1],trainX.shape[2])))
grid_model.add(LSTM(64,activation='tanh', return_sequences=True))
grid_model.compile(loss = 'mse',optimizer = optimizer)
return grid_model
grid_model = KerasRegressor(build_fn=build_model,verbose=1,validation_data=(testX,testY))
parameters = {'batch_size' : [12,24],
'epochs' : [8,30],
'optimizer' : ['adam','Adadelta'] }
grid_search = GridSearchCV(estimator = grid_model,
param_grid = parameters,
cv = 3)
grid_search =,trainY)
my_model = grid_search.best_estimator_.model
Get predictions
yhat = my_model.predict(testX)
Invert transformation of predictions and actual values
Here my problems begin, because I am not sure which way to go. I have read many tutorials, but it seems that those authors prefer to apply MinMaxScaler() on the entire dataset before splitting the data into train and test. I do not agree on this, because, otherwise, training data will be incorrectly scaled with information we should not use (i.e. the test set). So, I followed my approach, but I am stucked here.
I found this possible solution on another post, but it's not working for me:
# invert scaling for forecast
pred_scaler = MinMaxScaler(feature_range=(0, 1)).fit(df_test.values[:,0].reshape(-1, 1))
inv_yhat = pred_scaler.inverse_transform(yhat)
# invert scaling for actual
inv_y = pred_scaler.inverse_transform(testY)
In fact, when I double check the last values of the target from my original data set they don't match with the inverted scaled version of the testY.
Can someone please help me on this? Many thanks in advance for your support!
Two things could be mentioned here. First, you cannot inverse transform something you did not see. This happens because you use two different scalers. The NN will predict values in the range of Scaler 1, where it is not said that this lies within the range of Scaler 2 (scaled on test data). Second, the best practice is to fit your scaler on the training set and use the same scaler (only transform) on the test data as well. Now, you should be able to reverse transform your test results. Third if scaling wents off, because the test set has completely different values - e.g. happens with live streaming data, it is up to you to deal with it, e.g. the min-max scaler will produce values > 1.0.

Feature selection drastically decreases accuracy

Ive been playing around with Pyswarms specifically with discrete.binaryPSO to perform feature selection as it is an optimisation technique that helps perform feature subset selection to improve classifier performance. (<- link to pyswarms.
My dataset is based on text data with a corresponding label(identified in 1’s and 0’s). Upon preprocessing, i incorporated countvectorizer and tfidftransformer to the text data.
However a simple machine learning classifier using sklearn predicts a much higher accuracy in comparison to incorporating pyswarms. No matter what dataset i use, pre-processing techniques and functions i add when incorporating discrete.binarypso my accuracy, precision and recall is lower than a simple machine learning classification using SKlearn.
My code is attached below any help on the situation is appreciated:
# Create an instance of the classifier
classifier = LogisticRegression()
# Define objective function
# Define objective function
def f_per_particle(m, alpha):
total_features = training_data.shape[1]
# Get the subset of the features from the binary mask
if np.count_nonzero(m) == 0:
X_subset = training_data
X_subset = training_data[:,m==1]
# Perform classification and store performance in P, y_train)
P = (classifier.predict(X_subset) == y_train).mean()
# Compute for the objective function
j = (alpha * (1.0 - P)
+ (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features)))
return j
def f(x, alpha=0.88):
"""Higher-level method to do classification in the
whole swarm.
x: numpy.ndarray of shape (n_particles, dimensions)
The swarm that will perform the search
numpy.ndarray of shape (n_particles, )
The computed loss for each particle
n_particles = x.shape[0]
j = [f_per_particle(x[i], alpha) for i in range(n_particles)]
return np.array(j)
options = {'c1':0.5, 'c2': 0.5,'w':0.9,'k': 10,'p':2}
# Call instance of PSO
dimensions = training_data.shape[1] # dimensions should be the number of features
optimizer = ps.discrete.BinaryPSO(n_particles=10, dimensions=dimensions, options=options)
# Perform optimization
cost, pos = optimizer.optimize(f, iters=10)
print('selected features = ' + str(sum((pos == 1)*1)) + '/' + str(len(pos))), y_train)
print('accuracy before FS = ' + str(accuracy_score(y_test, classifier.predict(testing_data), normalize = True)*100))
X_subset = training_data[:,pos==1], y_train)
print('accuracy after FS = ' + str(accuracy_score(y_test, classifier.predict(testing_data[:,pos==1]), normalize = True)*100))
Since feature selection is not yielding better performance, I would recommend to use all the features in the machine learning model and see the impact of each feature. You may find[SHAP][1]
to be helpful for explaining the output and then look at the significance of each feature for this purpose.

Transform SHAP values from raw to native units with lightgbm Tweedie objective?

The utility of Shapley Additive Explanations (SHAP values) is to understand how each feature contributes to a model's prediction. For some objectives, such as regression with RMSE as an objective function, SHAP values are in the native units of the label values. For example, SHAP values could be expressed as USD if estimating housing costs. As you will see below, this is not the case for all objective functions. In particular, Tweedie regression objectives do not yield SHAP values in native units. This is a problem for interpretation, as we would want to know how housing costs are impacted by features in terms of +/- dollars.
Given this information, my question is: How do we transform the SHAP values of each individual feature into the data space of the target labels when explaining models with a Tweedie regression objective?
I'm not aware of any packages that currently implements such a transformation. This remains unresolved in the package put out by the shap authors themselves.
I illustrate the finer points of this question with the R implementation of lightgbm in the following:
tweedie_variance_power <- 1.2
labels <- rtweedie(1000, mu = 1, phi = 1, power = tweedie_variance_power)
feat1 <- labels + rnorm(1000) #good signal for label with some noise
feat2 <-rnorm(1000) #garbage feature
feat3 <-rnorm(1000) #garbage feature
features <- cbind(feat1, feat2, feat3)
dTrain <- lgb.Dataset(data = features,
label = labels)
params <- c(objective = 'tweedie',
tweedie_variance_power = tweedie_variance_power)
mod <- lgb.train(data = dTrain,
params = params,
nrounds = 100)
#Predictions in the native units of the labels
predsNative <- predict(mod, features, rawscore = FALSE)
#Predictions in the raw format
predsRaw <- predict(mod, features, rawscore = TRUE)
#We do not expect these values to be equal
all.equal(predsTrans, predsRaw)
"Mean relative difference: 1.503072"
#We expect values to be equal if raw scores are exponentiated
all.equal(predsTrans, exp(predsRaw))
"TRUE" #... our expectations are correct
#SHAP values
shapNative <- predict(mod, features, rawscore = FALSE, predcontrib = TRUE)
shapRaw <- predict(mod, features, rawscore = TRUE, predcontrib = TRUE )
#Are there differences between shap values when rawscore is TRUE or FALSE?
all.equal(shapNative, shapRaw)
"TRUE" #outputs are identical, that is surprising!
#So are the shap values in raw or native formats?
#To anwser this question we can sum them
#testing raw the raw case first
all.equal(rowSums(shapRaw), predsRaw)
#from this we can conclude that shap values are not in native units,
#regardless of whether rawscore is TRUE or FALSE
#Test native scores just to prove point
all.equal(rowSums(shapNative), predsNative)
"Mean relative difference: 1.636892" # reaffirms that shap values are not in native units
#However, we can perform this operation on the raw shap scores
#to get the prediction in the native value
all.equal(exp(rowSums(shapRaw)), predsNative)
#reversing the operations does not yield the same result
all.equal(rowSums(exp(shapRaw)), predsNative)
"Mean relative difference: 0.7662481"
#The last line is relevant because it implies
#The relationship between native predictions
#and exponentiated shap values is not linear
#So, given the point of SHAP is to understand how each
#feature impacts the prediction in its native units
#the raw shap values are not as useful as they could be
#Thus, how how would we convert
#each of these four raw shap value elements to native units,
#thus understanding their contributions to their predictions
#in currency of native units?
-0.15429227 0.04858757 -0.27715359 -0.48454457
My understanding of SHAP values is that they are in the native units of the labels/response when conducting regression, and that the sum of the SHAP values approximates the model's prediction.
I am trying to extract SHAP values in LightGBM package, with a Tweedie regression objective, but find that the SHAP values are not in the native units of the labels and that they do not sum to predicted values.
It appears that they must be exponentiated, is this correct?
Side note: I understand that the final column of the SHAP values matrix represents the base prediction, and must be added.
Reproducible example:
tweedie_variance_power <- 1.2
labels <- rtweedie(1000, mu = 1, phi = 1, power = tweedie_variance_power)
feat1 <- labels + rnorm(1000) #good signal for label with some noise
feat2 <-rnorm(1000) #garbage feature
feat3 <-rnorm(1000) #garbage feature
features <- cbind(feat1, feat2, feat3)
dTrain <- lgb.Dataset(data = features,
label = labels)
params <- c(objective = 'tweedie',
tweedie_variance_power = tweedie_variance_power)
mod <- lgb.train(data = dTrain,
params = params,
nrounds = 100)
preds <- predict(mod, features)
plot(preds, labels,
main = paste('RMSE =',
RMSE(pred = preds, obs = labels)))
#shap values are summing to negative values?
shap_vals <- predict(mod, features, predcontrib = TRUE, rawscore = FALSE)
shaps_sum <- rowSums(shap_vals)
plot(shaps_sum, labels,
main = paste('RMSE =',
RMSE(pred = shaps_sum, obs = labels)))
#maybe we need to exponentiate?
shap_vals_exp <- exp(shap_vals)
shap_vals_exp_sum <- rowSums(shap_vals_exp)
#still looks a little weird, overpredicting
plot(shap_vals_exp_sum, labels,
main = paste('RMSE =',
RMSE(pred = shap_vals_exp_sum, obs = labels)))
The order of operations is to sum first and then exponentiate the SHAP values, which will give you the predictions in native unit. Though I still am unclear on how to transform the feature level values to the native response units.
shap_vals_sum_exp <- exp(shaps_sum)
plot(shap_vals_sum_exp, labels,
main = paste('RMSE =',
RMSE(pred = shap_vals_sum_exp, obs = labels)))
I will show how to reconcile shap values and model predictions in Python, both in raw scores and original units. Hopefully it will help you understand where you are in R.
Step 1. Generate dataset
# pip install tweedie
import tweedie
y = tweedie.tweedie(1.2,1,1).rvs(size=1000)
X = np.random.randn(1000,3)
Step 2. Fit model
from lightgbm.sklearn import LGBMRegressor
lgb = LGBMRegressor(objective = 'tweedie'),y)
Step 3. Understand what shap values are.
Shap values for 0th data point
shap_values = lgb.predict(X, pred_contrib=True)
array([ 0.36841812, -0.15985678, 0.28910617, -0.27317984])
The first 3 are model contributions to baseline, i.e. shap values themselves:
The 4th is baseline in raw scores:
Sum of them add up to model prediction in raw scores:
shap_values[0,:3].sum() + shap_values[0,3]
Let's check against raw model predictions:
preds = lgb.predict(X, raw_score=True)
EDIT. Conversion between raw scores and original utits
To convert between raw scores and original units for Tweedie (and for Poisson and for Gamma) distribution you need to be aware of 2 facts:
Original is exp of raw
exp of sum is product of exps
0th prediction in original units:
Shap values for 0th row in raw score space:
shap_values = lgb.predict(X, pred_contrib=True, raw_score=True)
array([-0.77194274, -0.08343294, 0.22740536, -0.30358374])
Conversion of shap values to original units (product of exponents):[0]))
Looks similar to me again.

python: How to get real feature name from feature_importances

I am using Python's sklearn random forest (ensemble.RandomForestClassifier) to do classification and am using feature_importances_ to find significant feature for the classifier. Now my code is:
for trip in database:
# Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature
feat_loc_vectorizer = DictVectorizer()
feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)
orig_tfidf = TfidfTransformer()
orig_ven_feat = orig_tfidf.fit_transform(feat_loc_orig_mat.tocsr())
# so DictVectorizer() and TfidfTransformer() help me to phrase the features and for each instance, the feature dimension is 580, which means that there are 580 venue types
data = orig_ven_feat.tocsr()
le = LabelEncoder()
labels = le.fit_transform(labels_raw)
if "Unlabelled" in labels_raw:
unlabelled_int = int(le.transform(["Unlabelled"]))
unlabelled_int = -1
valid_rows_idx = np.where(labels!=unlabelled_int)[0]
labels = labels[valid_rows_idx]
user_ids = np.asarray(user_ids_raw)
# user_ids is for cross validation, labels is for classification
clf = ensemble.RandomForestClassifier(n_estimators = 50)
cv_indices = LeavePUsersOut(user_ids[valid_rows_idx], n_folds = 10)
data = data[valid_rows_idx,:].toarray()
for train_ind, test_ind in cv_indices:
train_data = data[train_ind,:]
test_data = data[test_ind,:]
labels_train = labels[train_ind]
labels_test = labels[test_ind]
print ("Training classifier..."),labels_train)
importances = clf.feature_importances_
Now the problem is that, I get an array of dimension 580 (same as feature dimension) when I use feature_importances, I want to know the top 20 important features (top 20 important venues)
I think at least what I should know is the indices of the 20 biggest number from importances, but I don't know:
How to get indices of top 20 from importances
Since I used Dictvectorizer and TfidfTransformer so I don't know how to match the indices with the real venue names ('school', 'home',....)
Any idea to help me? Thank you very much!
To get the importance for each feature name, just iterate through the columns names and feature_importances together (they map to each other):
for feat, importance in zip(df.columns, clf.feature_importances_):
print 'feature: {f}, importance: {i}'.format(f=feat, i=importance)
The feature_importances_ method returns the relative importance numbers in the order the features were fed to the algorithm. So in order to get the top 20 features you'll want to sort the features from most to least important for instance like this:
importances = forest.feature_importances_
indices = numpy.argsort(importances)[-20:]
([-20:] because you need to take the last 20 elements of the array since argsort sorts in ascending order)
