Why does Recursive Feature Elimination give different results each time in python? - python

I am currently trying to use RFE in python to select features for a stock prediction project. Since this is a regression task (predicting close price of next 10 days) I am using DecisionTreeRegressor as my model in RFE. Below is my code:
def performRFE(x, y, minFeatures):
rfecv = RFECV(
estimator=DecisionTreeRegressor(),
step=1,
cv=6,
min_features_to_select=minFeatures
)
rfecv.fit(x, y)
x = rfecv.transform(x)
// printing which features got selected
rfeOutput = list(zip(cols, list(rfecv.support_)))
for feature in rfeOutput:
print("{} : {}".format(feature[0], feature[1]))
I want to know if I am doing it right because for the same stock and the same set of original features, RFE gives a different result (picks a different subset of features) every time I run it. I plan to use the output of RFE in an LSTM model to predict the close price. Here x contains data such open,close,high,low,volume,EMAs, SMAs, RSI, etc. Is it normal to get different results from RFE each time?

Related

How to use .predict() in a Linear Regression model?

I'm trying to predict what a 15-minute delay in flight departure does to the flight's arrival time. I have thousands of rows as well as several columns in a DF. Two of these columns are dep_delay and arr_delay for departure delay and arrival delay. I have built a simple LinearRegression model:
y = nyc['dep_delay'].values.reshape((-1, 1))
arr_dep_model = LinearRegression().fit(y, nyc['arr_delay'])
And now I'm trying to find out the predicted arrival delay if the flights departure was delayed 15 minutes. How would I use the model above to predict what the arrival delay would be?
My first thought was to use a for loop / if statement, but then I came across .predict() and now I'm even more confused. Does .predict work like a boolean, where I would use "if departure delay is equal to 15, then arrival delay equals y"? Or is it something like:
arr_dep_model.predict(y)?
When working with LinearRegression models in sklearn you need to perform inference with the predict() function. But you also have to ensure the input you pass to the function has the correct shape (the same as the training data). You can learn more about the proper use of predict function in the official documentation.
arr_dep_model.predict(youtInput)
This line of code would output a value that the model predicted for a corresponding input. You can insert this into a for loop and traverse a set of values to serve as the model's input, it depends on the needs for your project and the data you are working with.
Hi Check below code for an example:`
import pandas as pd
import random
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X_train = df[['x1','x2']][:-3].values #Training on top 7 rows
y_train = df['y'][:-3].values #Training on top 7 rows
X_test = df[['x1','x2']][-3:].values # Values on which prediction will happen - bottom 3 rows
regr = LinearRegression()
regr.fit(X_train, y_train)
regr.predict(X_test)
If you will notice X_test the data on which prediction is happening is of same shape as (number of columns) as X_train both have two columns ['X1','X2']. Same has been converted in array when .values is used. You can create your own data (2 column dataframe for current example) & can use that for prediction (because 3rd column is need to be predicted).
Output will be three values as predicted on three rows:

features selection embedded method showing wrong features

in features selection (embedded method) i'm getting wrong features.
feature selection code:
# create the random forest model
model = RandomForestRegressor(n_estimators=120)
# fit the model to start training.
model.fit(X_train[_columns], X_train['delay_in_days'])
# get the importance of the resulting features.
importances = model.feature_importances_
# create a data frame for visualization.
final_df = pd.DataFrame({"Features": X_train[_columns].columns, "Importances":importances})
final_df.set_index('Importances')
# sort in descending order
final_df = final_df.sort_values('Importances',ascending=False)
#visualising feature importance
pd.Series(model.feature_importances_, index=X_train[_columns].columns).nlargest(10).plot(kind='barh')
_columns #my some selected features
enter image description here
here is the features list, as you can see total_open_amount is very important feature
but when I put top 3 features in my model I'm getting -ve R2_Score. but if I remove
total_open_amount from my model I'm getting decent R2_Score.
my question is what causing this ?(all the data train, test are randomly selected from dataset of size=100000)
clf = RandomForestRegressor()
clf.fit(x_train, y_train)
# Predicting the Test Set Results
predicted = clf.predict(x_test)
This is an educated guess since you did not provide the data itself. Looking at names of your features, the most important features are name customers and total open amount. I suppose these are features with a lot of unique values.
If you check the help page for random forest, it does mention:
Warning: impurity-based feature importances can be misleading for high
cardinality features (many unique values). See
sklearn.inspection.permutation_importance as an alternative.
This is also mentioned in a publication by Strobl et al:
We show that random forest variable importance measures are a sensible
means for variable selection in many applications, but are not
reliable in situations where potential predictor variables vary in
their scale of measurement or their number of categories.
I would try with the permutation importance and see whether I get the same results.

How to integrate SMOTE resampling and feature selection into RFECV

I am working on a dataset of shape (41188, 58) to make a binary classifier. The data is highly imbalanced. Initially, I intend to do feature selection by RFECV and this is the code I am using which has been borrowed from here:
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
min_features_to_select = 1 # Minimum number of features to consider
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(5),
scoring='accuracy',
min_features_to_select=min_features_to_select)
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.plot(range(min_features_to_select, len(rfecv.grid_scores_) +
min_features_to_select), rfecv.grid_scores_)
plt.show()
I got the following result:
then I changed the code to cv=StratifiedKFold(2) and min_features_to_select = 20 and this time I got:
In none of the cases above, resampling was done. Since resampling should be applied to the training data, and here I am using cross validation therefore, each training data fold should be resampled (e.g. SMOTE) as well. I wonder how to integrate resampling and feature selection into RFECV?

Python: statsmodels - what does .predict(X) actually predict?

I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing

How do I pass the output of SelectKBest to the cross_val_score function?

I have the following code here:
I am trying to retrieve the 20 best features from my dataset and then test the cross validated score with the Random Forest Classifier however once I've performed SelectKBest I recieve an output: X_train_selected and X_test_selected and it's not immediately obvious to me how I pass this to the cross val score function.
You don't need to separate train and test data for cross_val_score. The function itself just takes care of it. When passing the features set, you need to pass the complete feature set, not X_train and X-test
First seperate the target variable
target = df['result']
Then run the selectKbest and get the column names like you did, but this time instead of splitting the X into train and test, just pass them as single data set like this
X = clean_df[colnames_selected]
Then pass the X and target to cross_val_score
scores = cross_val_score(forest, X, target, cv=10)
print("Reduced features: mean of the scores: {:.2f}".format(scores.mean()))
The whole point of the function is to perform cross validation on the dataset and return the scores using the estimators provided.
You can also use pipelines to make this whole process more easier in go like this example.

Categories