I have a dataframe with 5 independent variables and 1 dependent variable. All my variables are continous including the dependent variable. Is there a way i can calculate which of my independent variables influences my dependent variable the most in python? Is there an algorithm i could ran to do this for me?
i tried the information gain method but that was a classification method so had to use a labelencoder to transform my dependent variable. I used the following code after splitting my dataset into a train and test set
#encoding the dependant variable
lab_enc = preprocessing.LabelEncoder()
training_scores_encoded = lab_enc.fit_transform(y_train)
#SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want.
#Firstly, I specify the random forest instance, indicating the number of trees.
#Then I use selectFromModel object from sklearn to automatically select the features.
sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
sel.fit(X_train, training_scores_encoded)
#We can now make a list and count the selected features.
selected_feat= X_train.columns[(sel.get_support())]
len(selected_feat)
#viewing the importances
import matplotlib.pyplot as plt
importances = sel.estimator_.feature_importances_
indices = np.argsort(importances)[::-1]
# X is the train data used to fit the model
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
color="r", align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
Although i got a result, I'm not sure about this because i had to encode my (continous) dependent variable. Is this correct way to go? if not what can i do?
Thank you in advance for the assitance
You can use the SelectKBest class from scikit-learn module.
Check the original documentation here.
This technique is called Feature Selection.
You can also pick features with the highest correlation to the response.
print([(feature, abs(df[response].corr(df[feature]))) for feature in features])
This uses values from Tamarie's comment.
for feature in feature_cols:
print(f'feature: {feature} correlation: {abs(target_v.corr(df[feature]))}')
Related
I am trying to get the features which are important for a class and have a positive contribution (having red points on the positive side of the SHAP plot).
I can get the shap_values and plot the shap summary for each class (e.g. class 2 here) using the following code:
import shap
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values[2], X)
From the plot I can understand which features are important to that class. In the below plot, I can say alcohol and sulphates are the main features (that I am more interested in).
However, I want to automate this process, so the code can rank the features (which are important on the positive side) and return the top N. Any idea on how to automate this interpretation?
I need to automatically identify those important features for each class. Any other method rather than shap that can handle this process would be ideal.
You can do the following steps - where basically we are trying to get only the values that effect the classification positively (shap_values>0) when shap_values<0 it means don't classify
Later you take mean and sort the results.
If you prefers the global values then use .abs() instead of [shap_df>0]
and for the hole model use only shap_values instead of shap_values['your_class_number']
import shap
import pandas as pd
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X)
shap_df = pd.DataFrame(shap_values['your_class_number'],columns=X.columns)
feature_importance = (shap_df
[shap_df>0]
.mean()
.sort_values(ascending=False)
.reset_index()
.rename(columns={'index':'feature',0:'weight'})
.head(n)
)
So I had to create a linear regression in python, but this dataset has over 800 columns. Is there anyway to see what columns are contributing most to the linear regression model? Thank you.
Look at the coefficients for each of the features. Ignore the sign of the coefficient:
A large absolute value means the feature is heavily contributing.
A value close to zero means the feature is not contributing much.
A value of zero means the feature is not contributing at all.
You can measure the correlation between each independent variable and dependent variable, for example:
corr(X1, Y)
corr(X2, Y)
.
.
.
corr(Xn, Y)
and you can test the model selecting the N most correlated variable.
There are more sophisticated methods to perform dimensionality reduction:
PCA (Principal Component Analysis)
(https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
Forward Feature Construction
Use XGBoost in order to measure feature importance for each variable and then select the N most important variables
(How to get feature importance in xgboost?)
There are many ways to perform this action and each one has pros and cons.
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/
If you are just looking for variables with high correlation I would just do something like this
import pandas as pd
cols = df.columns
for c in cols:
# Set this to whatever you would like
if df['Y'].corr(df[c]) > .7:
print(c, df['Y'].corr(df[c]))
after you have decided what threshold/columns you want you can append c to a list
I am using make_moons dataset and I am trying to implement an outlier detection algorithm. That's why I want to generate 3 points which are away from normal data, and testify if they are outlier or not. These 3 points should be randomly selected from my data and should be far as possible from the normal data.
My algorithm will compare the distance between that point with theresold value and finds if it is an outlier or not.
I am aware of the other resources to do that, but my specific problem to do that, is my dataset. I could not find a way to fit the solutions to my dataset
Here is my code to define dataset and fit into K-Means(I have to use K-Means fitted data):
data = make_moons(n_samples=100,noise=0, random_state=0)
X,y=data
n_clusters=10
kmeans = KMeans(n_clusters = n_clusters,random_state=10)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
Shortly, how can i find farthest 3 points in my data, to use it in outlier detection?
As stated in the comments, you should define a criteria to classify outliers. Either way, in the following code, I randomly selected three entries from X and multiplied them by 1,000, so surely that should make them outliers regardless of the definition you choose.
# Import libraries
import numpy as np
from sklearn.datasets import make_moons
# Create data
X, y = make_moons(100, random_state=123)
# Randomly select 3 row numbers from X
np.random.seed(5)
idx = np.random.randint(low=0, high=len(df[0]) + 1, size=3)
# Overwrite the data from the randomly selected rows
for i in idx:
scaler = 1000 # Change this number to whatever you need
X[i] = X[i] * scaler
Note: There is a small probability that idx will have duplicates. It won't happen with np.random.seed(5), but if you choose another seed (or opt to not use one at all) and get duplicates, simply try another one or repeat until you don't get duplicates.
I am looking to predict whether someone is a smoker from several columns of demographic data stored in a csv, as well as their smoker status.
The columns used are:
Gender, Age,Race, ServedInMilitary, CountryofBirth, EducationLevel MaritalStatus, HouseholdIncome, FamilyIncome, ChildrenInHouse, QuantitiyofAlcohol, PerUnitTime, ShortnessOfBreath, Asthma, Exercise, Smoker, SmokedBefore, AgeStartedSmoking.
All columns have numeric, but not necessarily binary values. Could someone help me correct my code to take these factors into account when determining smoker status and then help me measure the accuracy of my classifier?
I have the following code from a similar question: how to Load CSV Data in scikit and using it for Naive Bayes Classification
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
data['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
data['Type'] = pd.Factor(targets, target_names)
data['Targets'] = targets
# define training and test sets
train = data[data['is_train']==True]
test = data[data['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = data.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
#Predict Output
There seems to be a missing line here for the dataframe, but I will assume you have it. If you don't, then read your data using pandas.read_csv.
Also, your columns seem to have data that is both categorical and numerical. For example, the "SmokedBefore" column is likely 1/0 whereas your "Age" column is likely numbers such as 20 or 30.
This makes a difference, because in "SmokedBefore" the intent is not to say that 1>0. The intent is to say Yes/No. If your model assumes that higher (or lower) is better, then this will cause an issue. Therefore it is categorical and should not be treated like a numerical value. It is simply a tag to indicate whether someone has smoked before.
However, in "Age" the intent is to say that 30 is different than 20 by 10. Therefore, it is numerical and should be treated as such.
To treat this, you will need to transform your categorical features into another set of binary features that will balance out this effect and handle it for you. This is called One Hot Encoding. Instead of 1/0 on your "SmokedBefore", it will become "is_1" and "is_0" with corresponding data. Like that, each column will have a 1 and a 0.
You can simply use the onehotencoder function provided in sklearn. Use the categorical_features argument to specify which columns have categorical features
Scikit-learn has a mechanism to rank features (classification) using extreme randomized trees.
forest = ExtraTreesClassifier(n_estimators=250,
compute_importances=True,
random_state=0)
I have a question if this method is doing a "univariate" or "multivariate" feature ranking. Univariate case is where individual features are compared to each other. I would appreciate some clarifications here. Any other parameters that I should try to fiddle? Any experiences and pitfalls with this ranking methhod are also appreciated.
THe output of this ranking identify feature numbers(5,20,7. I would like to check if the feature number really corresponds to the row in the feature matrix. THat is, the feature number 5 corresponds to the sixth row in the feature matrix (starts with 0).
I'm not an expert but this is not univariate. In fact the total feature importance is computed from the feature importance of each tree (taking the mean value i think).
For each tree, the importances are computed from the impurity of the split.
I used this method and it seems to give good results, better from my point of view than the univariate method. But I don't know any technique to test the results except the knowledge of the dataset.
To order, the feature correctly you should follow this example and modify it a bit like so to use pandas.DataFrame and their proper column names:
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
X = pandas.DataFrame(...)
Y = pandas.Series(...)
# Build a forest and compute the feature importances
forest = ExtraTreesClassifier(n_estimators=250,
random_state=0)
forest.fit(X, y)
feature_importance = forest.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)[::-1]
print "Feature importance:"
i=1
for f,w in zip(X.columns[sorted_idx], feature_importance[sorted_idx]):
print "%d) %s : %d" % (i, f, w)
i+=1
pos = np.arange(sorted_idx.shape[0]) + .5
plt.subplot(1, 2, 2)
nb_to_display = 30
plt.barh(pos[:nb_to_display], feature_importance[sorted_idx][:nb_to_display], align='center')
plt.yticks(pos[:nb_to_display], X.columns[sorted_idx][:nb_to_display])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()