what is "UserWarning: No features were selected" - python

I am using LassoCV() model for feature selection. It is giving me this issue and not selecting any features too. "C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
UserWarning)"
The code is given below.
The data is in https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
# dataset URL = https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
dataframe = pd.read_csv('Brewer Friend Beer Recipes.csv', encoding = 'latin')
# Encoding the non numerical columns
def encoding_data(dataframe):
if(dataframe.dtype == 'object'):
return LabelEncoder().fit_transform(dataframe.astype(str))
else:
return dataframe
# Feature Selection using the selected Target Feature
def feature_selection(raw_dataframe, target_feature_list):
output_list = []
# preprocessing Converting Categorical data into Numeric Data
dataframe = raw_dataframe.apply(encoding_data)
column_list = dataframe.columns.tolist()
dataframe = dataframe.dropna()
for target in target_feature_list:
target_feature = target
x = dataframe.drop(columns=[target_feature])
y = dataframe[target_feature].values
# Lasso feature selection
estimator = LassoCV(cv = 3, n_alphas = 1)
featureselection = SelectFromModel(estimator)
featureselection.fit(x,y)
features = featureselection.transform(x)
feature_list = x.columns[featureselection.get_support()]
features = ''
features = ', '.join(feature_list)
l = (target,features)
output_list.append(l)
output_df = pd.DataFrame(output_list,columns = ['Name','Selected Features'])
print('\nThe Feature Selection is done with the respective target feature(s)')
return output_df
print(feature_selection(dataframe, ['BrewMethod']))
I am getting this warning and no features are selected.
"C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict. UserWarning)"
Any idea how to rectify this ?

If no features have been selected you can gradually decrease lambda (or in scikit's case alpha). This will reduce the penalization and probably return some nonzero coefficients.
It is extremely unusual that no coefficients have been selected. You should think about checking correlations in your data. Maybe you have a lot of collinearity.

Related

T-distributed Stochastic Neighbor Embedding (t-SNE)

I am trying to run T-distributed Stochastic Neighbor Embedding (t-SNE) in Jupyter but always facing a issue with
ValueError: could not convert string to float: '<Null>'
Code:
enter image description here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Reading the data using pandas
df = pd.read_csv("E:\\Field data\Output\\Pixel values7.csv")
# print first five rows of df
print(df.head(9))
# save the labels into a variable l.
l = df['label']
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)
I got error after this line
# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(df)
print(standardized_data.shape)
# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
# creating a new data frame which
# help us in plotting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))
# Plotting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
I got this link from somewhere, I am not expert in python. I request you to kindly help me out.
I am trying to run this program for my data but always getting a error
ValueError: could not convert string to float: '<Null>'
If there is any other code for T-distributed Stochastic Neighbor Embedding (t-SNE). Please let me know.
My data look like this

Selected Features Column Names in Scikit Learn Feature Selection

Figuring out which features were selected from the main dataframe is a very common problem data scientists face while doing feature selection using scikit-learn feature_selection module.
# importing modules
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
# creating X - train and Y - test variables
X = main_df.iloc[:,0:-1]
Y = main_df.iloc[:,-1]
# feature extraction
test = SelectKBest(score_func=f_regression, k=5)
features = test.fit_transform(X,Y)
# finding selected column names
feature_idx = test.get_support(indices=True)
feature_names = main_df.columns[feature_idx]
# creating selected features dataframe with corresponding column names
features = pd.DataFrame(features, columns=feature_names)
features.head()
I hope my code helps the community and if you like the effort, do upvote, it is a form of showing appreciation. Any and every feedback is appreciated.

Feature selection and categorical variables

I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:
Should i use LabelEncoder and then do the feature selection? Any ideas how to deal with this?
Here is my code
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()
It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.
You can use the group lasso implementation in python. Below I use an example dataset:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups
import scipy.sparse
data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
'cat2':np.random.choice(['D','E','F'],100),
'bin1':np.random.choice([0,1],100),
'bin2':np.random.choice([0,1],100)})
data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)
Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:
cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)
Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:
X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']
Likewise, construct the groups:
groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups
Run the group lasso:
grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)
grpLasso.sparsity_mask_
array([ True, True, True, False, False, False, True, True])
grpLasso.chosen_groups_
{0, 3, 4}
Check out also the help page for using it in a pipeline.

How to build a full trainset when loading data from predefined folds in Surprise?

I am using Surprise to evaluate various recommender system algorithms. I would like to calculate predictions and prediction coverage on all possible user and item permutations. My data is loaded in from predefined splits.
My strategy to calculate prediction coverage is to
build a full trainset and fit
get lists of all users and items
iterate through the list and make predictions
count exceptions where predictions are impossible to calculate prediction coverage.
Trying to call data.build_full_trainset()) yields the following error:
AttributeError: 'DatasetUserFolds' object has no attribute 'build_full_trainset'
Is there a way to build a full trainset when loading data from predefined folds?
Alternatively, I will attempt to combine the data externally from Surprise into a dataframe and redo the process. Or are there better approaches?
Thank you.
# %% #https://surprise.readthedocs.io/en/stable/getting_started.html#basic-usage
import random
import pickle
import numpy as np
import pandas as pd
# from survey.data_cleaning import long_ratings
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
# from surprise.model_selection import LeaveOneOut, KFold
from surprise.model_selection import PredefinedKFold
#set random seed for reproducibility
my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)
path = 'data/recommenders/'
def load_splits():
"""
Loads splits from files load data from splits created by colab code and stored to files. used in surprise_recommenders.py
returns splits as dataset
"""
# path to dataset folder
files_dir = 'data/recommenders/splits/'
# This time, we'll use the built-in reader.
reader = Reader(line_format='user item rating', sep=' ', skip_lines=0, rating_scale=(1, 5))
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
return data
data = load_splits()
pkf = PredefinedKFold()
algos = {
'NormalPredictor': {'constructor': NormalPredictor,
'param_grid': {}
}}
key = "stratified_5_fold"
cv_results={}
print(f"Performing {key} cross validation.")
for algo_name, v in algos.items():
print("Working on algorithm: ", algo_name)
gs = GridSearchCV(v['constructor'], v['param_grid'], measures=['rmse', 'mae'], cv=pkf)
gs.fit(data)
# best RMSE score
print(gs.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
# Predict on full dataset
# Use the weights that yields the best rmse:
algo = gs.best_estimator['rmse']
algo.fit(data.build_full_trainset()) #predefined folds breaks it.
cv_results[algo_name] = pd.DataFrame.from_dict(gs.cv_results)
TLDR; The model_selection documentation in Surprise indicates a "refit" method, that will fit data on a full trainset, however it explicitly doesn't work with predefined folds.
Another major issue: oyyablokov's comment on this issue suggests you cannot fit a model with data that has NaNs. So even if you have a full trainset, how does one create a full prediction matrix to calculate things like prediction coverage, which requires all users and item combinations with or without ratings?
My workaround was to create 3 Surprise datasets.
The dataset from predefined folds to compute best_params
The full dataset of ratings (combining all folds outside of Surprise)
The full prediction matrix dataset including all possible combinations of users and items (with or without ratings).
After you find your best paramaters with grid search cross validation, you can find your predictions and coverage with something like this:
import pandas as pd
from surprise import Dataset, Reader
def get_pred_coverage(data_matrix, algo_constructor, best_params, verbose=False):
"""
Calculates coverage
inputs:
data_matrix: Numpy Matrix with 0, 1, 2 columns as user, service, rating
algo_constructor: the Surprise algorithm constructor to pass the best params into
best_params: Surprise gs.best_params to pass into algo.
returns: coverage & full predictions
"""
reader=Reader(rating_scale=(1,5))
full_predictions = [] #list to store prediction results
df = pd.DataFrame(data_matrix)
if verbose: print(df.info())
df_no_nan = df.dropna(subset=[2])
if verbose: print(df_no_nan.head())
no_nan_dataset = Dataset.load_from_df(df_no_nan[[0,1,2]], reader)
full_dataset = Dataset.load_from_df(df[[0, 1, 2]], reader)
#Predict on full dataset
# Use the weights that yields the best rmse:
algo = algo_constructor(**best_params) # Pass the dictionary as double star keyword arguments to the algorithm constructor
#Create a no-nan trainset to fit on
no_nan_trainset = no_nan_dataset.build_full_trainset()
algo.fit(no_nan_trainset)
if verbose: print('Number of trainset users: ', no_nan_trainset.n_users, '\n')
if verbose: print('Number of trainset items: ', no_nan_trainset.n_items, '\n')
pred_set = full_dataset.build_full_trainset()
if verbose: print('Number of users: ', pred_set.n_users, '\n')
if verbose: print('Number of items: ', pred_set.n_items, '\n')
#get all item ids
pred_set_iids = list(pred_set.all_items())
# print(f'pred_set iids are {pred_set_iids}')
iid_converter = lambda x: pred_set.to_raw_iid(x)
pred_set_raw_iids = list(map(iid_converter, pred_set_iids))
#get all user ids
pred_set_uids = list(pred_set.all_users())
uid_converter = lambda x: pred_set.to_raw_uid(x)
pred_set_raw_uids = list(map(uid_converter, pred_set_uids))
# print(f'pred_set uids are {pred_set_uids}')
for user in pred_set_raw_uids:
for item in pred_set_raw_iids:
r_ui = float(df[2].loc[(df[0] == user) & (df[1]== item)]) #find the rating, by user and value
# print(f"r_ui is type {type(r_ui)} and value {r_ui}")
prediction = algo.predict(uid = user, iid = item, r_ui=r_ui)
# print(prediction)
full_predictions.append(prediction)
#access a tuple
#5th element, dicitonary item "was_impossible"
impossible_count = 0
for prediction in full_predictions:
impossible_count += prediction[4]['was_impossible']
if verbose: print(f"for algo {algo}, impossible_count is {impossible_count} ")
prediction_coverage = (pred_set.n_users*pred_set.n_items - impossible_count)/(pred_set.n_users*pred_set.n_items)
print(f"prediction_coverage is {prediction_coverage}")
return prediction_coverage, full_predictions

Python running extremely slow one one line of code

I'm running the code below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:\\path_here\\train.csv')
test=pd.read_csv('C:\\path_here\\test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
ID_col = ['REF_NO']
target_col = ['Status']
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns)))
other_col=['Type'] #Test and Train Data set identifier
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with 0
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
It seems to run, endlessly, in this line.
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
I can't get it past that spot. how can I see what's happening in the background? Is there some way to see the work that's being done? Thanks.
One way to check where to code is getting to is to add print statements. For example you can add (right before the label encoder):
print("Code got before label encoder")
And then after that code block add another print statement. You can see in your console exactly where the code is getting stuck and debug that specific line.

Categories