I am working on a classification problem. I have around 1000 features and target variable has 2 classes. All the 1000 features have values 1 or 0. I am trying to find feature importance but my feature importance values varies from 0.0 - 0.003. I am not sure if such low value is meaningful.
Is there a way I can increase feature importance.
# Variable importance
rf = RandomForestClassifier(min_samples_split=10, random_state =1)
rf.fit(X, Y)
print ("Features sorted by their score:")
a = (list(zip(map(lambda x: round(x, 3), rf.feature_importances_), X)))
I would really appreciate any help! Thanks
Since you only have two target classes you can perform an unequal variance t-test which has been useful to find important features in a binary classification task when all other feature ranking methods have failed me. You can implement this using scipy.stats.ttest_ind function. It basically is a statistical test that checks whether the two distributions are different. if the returned p-value is less than 0.05, they can be assumed to be different distributions. To implement for each feature, follow these steps:
Extract all predictor values for class 1 and 2 respectively.
Run test_ind on these two distributions, specifying that they're variance is unknown, and make sure it's a two tailed t-test
If the p-value is less than 0.05, this feature is important.
Alternatively, you can do this for all your features and use the p-value as the measure of feature importance. The lower, the p-value, the higher the importance of a feature.
Cheers!
Related
So I had to create a linear regression in python, but this dataset has over 800 columns. Is there anyway to see what columns are contributing most to the linear regression model? Thank you.
Look at the coefficients for each of the features. Ignore the sign of the coefficient:
A large absolute value means the feature is heavily contributing.
A value close to zero means the feature is not contributing much.
A value of zero means the feature is not contributing at all.
You can measure the correlation between each independent variable and dependent variable, for example:
corr(X1, Y)
corr(X2, Y)
.
.
.
corr(Xn, Y)
and you can test the model selecting the N most correlated variable.
There are more sophisticated methods to perform dimensionality reduction:
PCA (Principal Component Analysis)
(https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
Forward Feature Construction
Use XGBoost in order to measure feature importance for each variable and then select the N most important variables
(How to get feature importance in xgboost?)
There are many ways to perform this action and each one has pros and cons.
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/
If you are just looking for variables with high correlation I would just do something like this
import pandas as pd
cols = df.columns
for c in cols:
# Set this to whatever you would like
if df['Y'].corr(df[c]) > .7:
print(c, df['Y'].corr(df[c]))
after you have decided what threshold/columns you want you can append c to a list
I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).
Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.
I have a regression model where my target variable (days) quantitative values ranges between 2 to 30. My RMSE is 2.5 and all the other X variables(nominal) are categorical and hence I have dummy encoded them.
I want to know what would be a good value of RMSE? I want to get something within 1-1.5 or even lesser but I am unaware what I should do to achieve the same.
Note# I have already tried feature selection and removing features will less importance.
Any ideas would be appreciated.
If your x values are categorical then it does not necessarily make much sense binding them to a uniform grid. Who's to say category A and B should be spaced apart the same as B and C. Assuming that they are will only lead to incorrect representation of your results.
As your choice of scale is the unknowns, you would be better in terms of visualisation to set your uniform x grid as being the day number and then seeing where the categories would place on the y scale if given a linear relationship.
RMS Error doesn't come into it at all if you don't have quantitative data for x and y.
Does any one know how to set parameter of alpha when doing naive bayes classification?
E.g. I used bag of words firstly to build the feature matrix and each cell of matrix is counts of words, and then I used tf(term frequency) to normalized the matrix.
But when I used Naive bayes to build classifier model, I choose to use multinomial N.B (which I think this is correct, not Bernoulli and Gaussian). the default alpha setting is 1.0 (the documents said it is Laplace smoothing, I have no idea what is).
The result is really bad, like only 21% recall to find the positive class (target class). but when I set alpha = 0.0001 (I randomly picked), the results get 95% recall score.
Besides, I checked the multinomial N.B formula, I think it is because the alpha problem, because if I used counts of words as feature, the alpha = 1 is doesn't to effect the results, however, since the tf is between 0-1, the alpha = 1 is really affect the results of this formula.
I also tested the results not use tf, only used counts of bag of words, the results is 95% as well, so, does any one know how to set the alpha value? because I have to use tf as feature matrix.
Thanks.
In Multinomial Naive Bayes, the alpha parameter is what is known as a hyperparameter; i.e. a parameter that controls the form of the model itself. In most cases, the best way to determine optimal values for hyperparameters is through a grid search over possible parameter values, using cross validation to evaluate the performance of the model on your data at each value. Read the above links for details on how to do this with scikit-learn.
why alpha is used?
For classifying query point in NB P(Y=1|W) or P(Y=0|W) (considering binary classification)
here W is vector of words W= [w1, w2, w3.... wd]
d = number of features
So, to find probability of all these at training time
P(w1|Y=1) * P(w2|Y=1) *.....P(wd|Y=1)) * P(Y=1)
Same above should be done for Y=0.
For Naive Bayes formula refer this (https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
Now at testing time, consider you encounter word which is not present in train set then its probability of existence in a class is zero, which will make whole probability 0, which is not good.
Consider W* word not present in training set
P(W*|Y=1) = P(W*,Y=1)/P(Y=1)
= Number of training points such that w* word present and Y=1 / Number of training point where Y=1
= 0/Number of training point where Y=1
So to get rid of this problem we do Laplace smoothing.
we add alpha to numerator and denominator field.
= 0 + alpha / Number of training point where Y=1 + (Number of class labels in classifier * alpha)
It happens in real world, some words occurs very few time and some more number of times or think in different way, in above formula (P(W|Y=1) = P(W,Y=1)/P(Y=1) ) if numerator and denominator fields are small means It is easily influenced by outlier or noise. Here also alpha helps as it moves my likelihood probabilities to uniform distribution as alpha increases.
So alpha is hyper parameter and you have to tune it using techniques like grid search (as mentioned by jakevdp) or random search. (https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624)
It is better that to use Gridsearchcv or RandomSearchcv(use this if on low spec model) for automating your hyperparameter which is alpha in case of MultinomialNB.
Do like this:
model=MultinomialNB()
param={'alpha': [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100,1000]}
clf=GridSearchCV(model,param,scoring='roc_auc',cv=10,return_train_score=True)
I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!
You have at least two options:
Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.
Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.
https://github.com/remykarem/mixed-naive-bayes
The library is written such that the APIs are similar to scikit-learn's.
In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.
from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
[1, 1, 165.2, 61.5],
[2, 1, 166.3, 60.3],
[1, 1, 173.0, 68.2],
[0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)
Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)
The simple answer: multiply result!! it's the same.
Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).
so the right answer is:
calculate the probability from the categorical variables.
calculate the probability from the continuous variables.
multiply 1. and 2.
#Yaron's approach needs an extra step (4. below):
Calculate the probability from the categorical variables.
Calculate the probability from the continuous variables.
Multiply 1. and 2.
AND
Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.
Step 4. is the normalization step. Take a look at #remykarem's mixed-naive-bayes as an example (lines 268-278):
if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
finals = t * p * self.priors
elif self.gaussian_features.size != 0:
finals = t * self.priors
elif self.categorical_features.size != 0:
finals = p * self.priors
normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
normalised = np.moveaxis(normalised, [0, 1], [1, 0])
return normalised
The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).
For hybrid features, you can check this implementation.
The author has presented mathematical justification in his Quora answer, you might want to check.
You will need the following steps:
Calculate the probability from the categorical variables (using predict_proba method from BernoulliNB)
Calculate the probability from the continuous variables (using predict_proba method from GaussianNB)
Multiply 1. and 2. AND
Divide by the prior (either from BernoulliNB or from GaussianNB since they are the same) AND THEN
Divide 4. by the sum (over the classes) of 4. This is the normalisation step.
It should be easy enough to see how you can add your own prior instead of using those learned from the data.