How does SelectFromModel from scikit-learn select features? - python

When I use XGBClassifier with SelectFromModel the algorithm always returns around five features regardless of the max_features value
My question is: does XGBClassifier though that there are only five useful features in my dataset?
from sklearn.feature_selection import SelectFromModel
from xgboost import XGBClassifier
sf=SelectFromModel(XGBClassifier(), max_features=10).fit(X, y)
#The output only contains five True, all remaining are False
print(sf.get_support())

To only select based on max_features, set threshold=-np.inf.
I found the above text in the documentation sklearn.feature_selection. This means as priority SelectFromModel depends on the threshold parameter and returns all features that pass the threshold (regardless of max_features).
If you want max_features fully function, then set threshold=-np.inf, in this case, all features pass the threshold, then max_features can select demanded features (based on their rank).

You can use threshold='median' to get the best half of the features, then you can call it one more time on the resulting dataset to get the best quarter, and so on. You can use Pipeline for better handling repeated feature reduction.
sf = SelectFromModel(RandomForestClassifier(), threshold='median')

Related

Using AdaBoostClassifier with null values

I'm trying to implement a AdaBoostClassifier model in Python.
I'm using a dataset where all columns are numbers and in some cases the numbers are null.
Using adaboost in R, it seams that R deals with nulls automaticaly however when I try to do the same thing in python I get the error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
If I try to manually fix this with:
X.fillna(X.mean(), inplace=True)
The problem goes away. But I do not want to average the null values.
Can AdaBoostClassifier work with null in python? or do i have to treat them first?
PS: I tried to give allow_nan=True in the validation function that adaboost uses but... I really don't know how to do that in the correct form.
Thank you
import pandas as pdfrom sklearn.ensemble import AdaBoostClassifierfrom sklearn.datasets import make_classification
data = pd.read_excel("C:\Users\file.xlsx")
X =data[["oprevenue","total_assets","fixed_assets","cost_of_employees","sales","ebitda","volume","number_of_employees"]]y = data.iloc[:,-1]
AdaModel = AdaBoostClassifier(n_estimators=100,learning_rate=1)
model = AdaModel.fit(X,y) #-->Blows up here
previsao = model.predict(X)
The default base estimator for AdaBoostClassifier is a DecisionTree which does not handle missing values. You need to either impute your missing values first or use a different base estimator that can handle them.

decision tree always returns the same value for different inputs

I am new to coding. I am learning machine learning with python. Using decision tree, I trying to predict chance for heart attack of individual using dataset from Kaggle. After modeling when I try to predict for different inputs it is always returning same output [1]. What may be the problem? What can I do? This my code.
import pandas as pd
import numpy as np
heart=pd.read_csv('heart_attack.csv')
heart.fillna(heart.mean(),inplace=True)
x=heart.iloc[:,:-1]
y=heart.iloc[:,-1]
import sklearn
from sklearn.preprocessing import LabelEncoder
x=x.apply(LabelEncoder().fit_transform)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.20, random_state=85)
from sklearn.tree import DecisionTreeClassifier
result=DecisionTreeClassifier()
result.fit(x_train,y_train)
y_pred=result.predict(x_test)
This is the code where i store the input values
Patient_Data = [Patient_Age,Patient_Gender,Patient_pain,Patient_RBP,Patient_chol,Patient_FBS,Patient_ECG,Patient_thalach,Patient_exang,Patient_oldpeak,Patient_slope,Patient_thal]
Patient_Data_New= pd.DataFrame([Patient_Data],columns=['Age','Gender','cp','restbps','chol','FBS','restecg','thalach','exang','oldpeak','slope', 'thal'])
Patient= result.predict(Patient_Data_New)
if Patient>0:
print ('This patient has a chance to get heart attack')
else:
print ('This patient does not have a chance to get heart attack')
Thanks in advance.
This is probably because you are predicting values for data which is already present in the training set, hence your model is able to predict all the values correctly thus giving 100 percent accuracy.
try this
result.score
if the result is 100 percent then it's an example of overfitting and you need to drop a few columns andfind the best params using k-fold cross validation.

How to use k means for a product recommendation dataset

I have a data set with columns titled as product name, brand,rating(1:5),review text, review-helpfulness. What I need is to propose a recommendation algorithm using reviews. I have to use python for coding here. data set is in .csv format.
To identify the nature of the data set I need to use kmeans on the data set. How to use k means on this data set?
Thus I did following,
1.data pre-processing,
2.review text data cleaning,
3.sentiment analysis,
4.giving sentiment score from 1 to 5 according to the sentiment value (given by sentiment analysis) they get and tagging reviews as very negative, negative, neutral, positive, very positive.
after these procedures i have these columns in my data set, product name, brand,rating(1:5),review text, review-helpfulness, sentiment-value, sentiment-tag.
This is the link to the data set https://drive.google.com/file/d/1YhCJNvV2BQk0T7PbPoR746DCL6tYmH7l/view?usp=sharing
I tried to get k means using following code It run without error. but I don't know this is something useful or is there any other ways to use kmeans on this data set to get some other useful outputs. To identify more about data how should i use k means in this data set..
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df.info()
X = np.array(df.drop(['sentiment_value'], 1).astype(float))
y = np.array(df['rating'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
plt.show()
You did not plot anything.
So nothing shows up.
Unless you are more specific about what you are trying to achieve we won't be able to help. Figure out what exactly you want to predict. Do you just want to cluster products according to their sentiment score which isn't especially promising or do you want to predict actual product preferences on a new dataset?
If you want to build a recommendation system the only possibility (considering your dataset) would be to identify similar products according to the rating/sentiment. Is that what you want?

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).
I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.
For instance:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
my_model_name = XGBClassifier()
my_model_name.fit(X,Y)`
where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.
Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set.
Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.
You can get the features names by:
model.get_booster().feature_names
You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.
But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.
Then you should be able to:
change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)
EDIT:
Thanks to #Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:
xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
model.get_booster().get_score() also uses "weight" as the default (see get_score)
model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)
For more info on this topic, look at How to get feature importance.
I tried the above answers, and didn't work while loading the model after training.
So, the working code for me is :
model.feature_names
it returns a list of the feature names
I think, it is best to turn numpy array back into pandas DataFrame. E.g.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
Y=label
X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)
my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)
xgb.plot_importance(my_model_name)
plt.show()
This will show the original names.

Up-/downsampling with One vs. rest classifier

I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.
Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)
But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?
As per the discussion in comments, what you want can be done like this:
from sklearn.naive_bayes import MultinomialNB
from imblearn.combine import SMOTEENN
# Observe how I imported Pipeline from IMBLEARN and not SKLEARN
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
# This pipeline will resample the data and
# pass the output to MultinomialNB
pipe = Pipeline([('sampl', SMOTEENN()),
('clf', MultinomialNB())])
# OVR will transform the `y` as you know and
# then pass single label data to different copies of pipe
# multiple times (as many labels in data)
ovr = OneVsRestClassifier(pipe)
ovr.fit(X, y)
Explanation of code:
Step 1: OneVsRestClassifier will create multiple columns of y. One for each label, where that label is positive and all other are negative.
Step 2: For each label, OneVsRestClassifier will clone the supplied pipe estimator and pass the individual data to it.
Step 3:
a. Each copy of pipe will get a different version of y, which is passed to SMOTEENN inside it and so will do a different sampling to balance the classes there.
b. The second part of pipe (clf) will get that balanced dataset for each label as you wanted.
Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline.
Hope this helps.

Categories