Figuring out which features were selected from the main dataframe is a very common problem data scientists face while doing feature selection using scikit-learn feature_selection module.
# importing modules
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
# creating X - train and Y - test variables
X = main_df.iloc[:,0:-1]
Y = main_df.iloc[:,-1]
# feature extraction
test = SelectKBest(score_func=f_regression, k=5)
features = test.fit_transform(X,Y)
# finding selected column names
feature_idx = test.get_support(indices=True)
feature_names = main_df.columns[feature_idx]
# creating selected features dataframe with corresponding column names
features = pd.DataFrame(features, columns=feature_names)
features.head()
I hope my code helps the community and if you like the effort, do upvote, it is a form of showing appreciation. Any and every feedback is appreciated.
Related
image of the error I am trying to build a collaborative recommendation system the code below. I am a noob to deep learning right now, and I am stuck with this error when I try to train the model. I want to train a model with a csv data set. Can anyone please help me understand what's happening? I would really appreciate it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Import the surprise packages
from surprise import Dataset
from surprise import Reader
from surprise.prediction_algorithms.matrix_factorization import SVD as FunkSVD
# Import GridSearchCV for algorithm tuning
from surprise.model_selection import GridSearchCV
# Import train_test_split
from surprise.model_selection import train_test_split
# Read in the prepared dataframe from the user_cleanup notebook
user_df = pd.read_csv('user_clean.csv')
user_df.head()
# Merge the two dataframes on appid
df = user_df.merge(games_df,on='appid')
df = df.drop('name',1)
df.head()
# Let's take a look at one of the most prominent users in the dataset, user 24469287
df[df['user_id'] == 24469287]
# Let's find this users favorite games using the 1-5 rating scale
print(f"Shape:{df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)].shape}")
display(df[(df['user_id'] == 24469287) & (df['rating_5'] == 5)])
# Prepare the dataframes for the surprise package
# Dataframe needs to contain 3 columns: user id, item id, and rating
# For the 1-10 scale
rating_10_df = df.filter(['user_id','appid','rating_10'])
rating_10_df = rating_10_df.sort_values(by=['user_id','appid'])
# And the 1-5 scale
rating_5_df = df.filter(['user_id','appid','rating_5'])
rating_5_df = rating_5_df.sort_values(by=['user_id','appid'])
# Confirm dataframe is set up properly (user, item, rating)
rating_10_df.head()
# initialize the reader with 1-10 rating scale
my_reader = Reader(rating_scale=(0,10))
# load the dataframe with the reader
md = Dataset.load_from_df(rating_10_df, my_reader)
%%time
# Set the parameter grid for optimization
param_grid = {
# Number of latent factors. More factors could give better results, but can also lead overfitting
'n_factors': [50, 100, 150],
# Number of epochs. Number of iterations the algorithm will run
'n_epochs': [10, 20, 50],
# Learning rate. The speed at which algorithm learns. Larger values give faster learning, but smaller values give more accurate learning.
'lr_all': [0.005, 0.1],
'biased': [False] }
# Set GridSearchCV with 5 fold cross-validation using the FunkSVD
GS = GridSearchCV(FunkSVD, param_grid, measures=['rmse','mae','fcp'], cv=5)
# Fit the model to the data
GS.fit(md)
I have not clustered data in a while and at the moment i have a massive list of accounts with their perspective areas (or OUs in the table below).
I have used kmeans and kmodes to try and cluster based on OU - meaning that I want the output to group the 17 OUs i have and cluster them based on the provided information. Thus far the output has provided me with clustering based on each record individually and not based on each OU. can some one help me figure out how to group the output then cluster somehow? below is the same of the code used.
# Building the model with 3 clusters
kmode = KModes(n_clusters=3, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(df)
clusters
#insert the predicted cluster values in our original dataset.
df.insert(0, "Cluster", clusters, True)
df.head(10)
I don't have access to your data set, but below is a generic example of how to do clustering.
# Cluster analysis, or clustering, is an unsupervised machine learning task.
# It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling),
# clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
See the link below for more details.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I want to identify the 10 best features of a Dataframe using the Information Gain measure (Mutual Info in scikit-learn) and display them in a table (in ascending order according to the score obtained by the Information Gain).
In this example, features is the Dataframe that contains all the interesting training data that could tell if a restaurant will close or not.
# Initialization of data and labels
x = features.copy () # "x" contains all training data
y = x ["closed"] # "y" contains the labels of the records in "x"
# Elimination of the class column (closed) of features
x = x.drop ('closed', axis = 1)
# this is x.columns, sorry for the mix french and english
features_columns = ['moyenne_etoiles', 'ville', 'zone', 'nb_restaurants_zone',
'zone_categories_intersection', 'ville_categories_intersection',
'nb_restaurant_meme_annee', 'ecart_type_etoiles', 'tendance_etoiles',
'nb_avis', 'nb_avis_favorables', 'nb_avis_defavorables',
'ratio_avis_favorables', 'ratio_avis_defavorables',
'nb_avis_favorables_mention', 'nb_avis_defavorables_mention',
'nb_avis_favorables_elites', 'nb_avis_defavorables_elites',
'nb_conseils', 'nb_conseils_compliment', 'nb_conseils_elites',
'nb_checkin', 'moyenne_checkin', 'annual_std', 'chaine',
'nb_heures_ouverture_semaine', 'ouvert_samedi', 'ouvert_dimanche',
'ouvert_lundi', 'ouvert_vendredi', 'emporter', 'livraison',
'bon_pour_groupes', 'bon_pour_enfants', 'reservation', 'prix',
'terrasse']
# normalization
std_scale = preprocessing.StandardScaler().fit(features[features_columns])
normalized_data = std_scale.transform(features[features_columns])
labels = np.array(features['closed'])
# split the data
train_features, test_features, train_labels, test_labels = train_test_split(normalized_data, labels, test_size = 0.2, random_state = 42)
labels_true = ?
labels_pred = ?
# I dont really know how to use this function to achieve what i want
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(features,
columns=['Coefficient'], index=x.columns)
coeff_df.head()
What is the correct syntax using Mutual Info score to achieve this?
The adjusted_mutual_info_score compares ground truth labels with labels predictions from a classifier. Both label arrays must have the same shape (nsamples,).
You need Scikit-Learn's mutual_info_classif for what you are trying to achieve. Pass the array of features and the corresponding labels to mutual_info_classif to get back the estimated mutual information between each feature and the target.
import numpy as np
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
# Generate a sample data frame
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=2,
random_state=0, shuffle=False)
feature_columns = ['A', 'B', 'C', 'D']
features = pd.DataFrame(X, columns=feature_columns)
# Get the mutual information coefficients and convert them to a data frame
coeff_df =pd.DataFrame(mutual_info_classif(X, y).reshape(-1, 1),
columns=['Coefficient'], index=feature_columns)
Output
features.head(3)
Out[43]:
A B C D
0 -1.668532 -1.299013 0.799353 -1.559985
1 -2.972883 -1.088783 1.953804 -1.891656
2 -0.596141 -1.370070 -0.105818 -1.213570
# Displaying only the top two features. Adjust the number as required.
coeff_df.sort_values(by='Coefficient', ascending=False)[:2]
Out[44]:
Coefficient
B 0.523911
D 0.366884
I am using LassoCV() model for feature selection. It is giving me this issue and not selecting any features too. "C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
UserWarning)"
The code is given below.
The data is in https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
# dataset URL = https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
dataframe = pd.read_csv('Brewer Friend Beer Recipes.csv', encoding = 'latin')
# Encoding the non numerical columns
def encoding_data(dataframe):
if(dataframe.dtype == 'object'):
return LabelEncoder().fit_transform(dataframe.astype(str))
else:
return dataframe
# Feature Selection using the selected Target Feature
def feature_selection(raw_dataframe, target_feature_list):
output_list = []
# preprocessing Converting Categorical data into Numeric Data
dataframe = raw_dataframe.apply(encoding_data)
column_list = dataframe.columns.tolist()
dataframe = dataframe.dropna()
for target in target_feature_list:
target_feature = target
x = dataframe.drop(columns=[target_feature])
y = dataframe[target_feature].values
# Lasso feature selection
estimator = LassoCV(cv = 3, n_alphas = 1)
featureselection = SelectFromModel(estimator)
featureselection.fit(x,y)
features = featureselection.transform(x)
feature_list = x.columns[featureselection.get_support()]
features = ''
features = ', '.join(feature_list)
l = (target,features)
output_list.append(l)
output_df = pd.DataFrame(output_list,columns = ['Name','Selected Features'])
print('\nThe Feature Selection is done with the respective target feature(s)')
return output_df
print(feature_selection(dataframe, ['BrewMethod']))
I am getting this warning and no features are selected.
"C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict. UserWarning)"
Any idea how to rectify this ?
If no features have been selected you can gradually decrease lambda (or in scikit's case alpha). This will reduce the penalization and probably return some nonzero coefficients.
It is extremely unusual that no coefficients have been selected. You should think about checking correlations in your data. Maybe you have a lot of collinearity.
I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support