Feature selection and categorical variables

Feature selection and categorical variables - python

I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:
Should i use LabelEncoder and then do the feature selection? Any ideas how to deal with this?
Here is my code
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()

It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.
You can use the group lasso implementation in python. Below I use an example dataset:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups
import scipy.sparse
data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
'cat2':np.random.choice(['D','E','F'],100),
'bin1':np.random.choice([0,1],100),
'bin2':np.random.choice([0,1],100)})
data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)
Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:
cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)
Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:
X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']
Likewise, construct the groups:
groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups
Run the group lasso:
grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)
grpLasso.sparsity_mask_
array([ True, True, True, False, False, False, True, True])
grpLasso.chosen_groups_
{0, 3, 4}
Check out also the help page for using it in a pipeline.

Related

ValueError: could not convert string to float: 'Q'

I am new to programming and I was working with the titanic dataset from Kaggle. I have been trying to build the Logistic Regression model after performing one-hot encoding. But I keep getting the error. I think the error is caused due to the dummy variable. Below is my code.
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
#Loading data
df=pd.read_csv(r"C:\Users\Downloads\train.csv")
#Deleting unwanted columns
df.drop(["PassengerId","Name","Cabin","Ticket"],axis=1,inplace=True)
#COunt of Missing values in each column
print(df.isnull().sum())
#Deleting rows with missing values based on column name
df.dropna(subset=['Embarked','Age'],inplace=True)
print(df.isnull().sum())
#One hot encoding for categorical variables
#Creating dummy variables for Sex column
dummies = pd.get_dummies(df.Sex)
dummies2=pd.get_dummies(df.Embarked)
#Appending the dummies dataframe with original dataframe
new_df= pd.concat([df,dummies,dummies2],axis='columns')
print(type(new_df))
#print(new_df.head(10))
#Drop the original sex,Embarked column and one of the dummy column for bth variables
new_df.drop(['Sex','Embarked'],axis='columns',inplace=True)
print(new_df.head(10))
new_df.info()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score
x = df.drop('Survived', axis=1)
y = df['Survived']
logmodel = LogisticRegression()
logmodel.fit(x, y)

As we discussed in the comments, here is the solution:
First, you need to modify your x and y variables to use new_df instead of df just like so:
x = new_df.drop('Survived', axis=1)
y = new_df['Survived']
Then, you need to increase the iteration of your Logistic Regression Model like so:
logmodel = LogisticRegression(max_iter=1000)

what is "UserWarning: No features were selected"

I am using LassoCV() model for feature selection. It is giving me this issue and not selecting any features too. "C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
UserWarning)"
The code is given below.
The data is in https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
# dataset URL = https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
dataframe = pd.read_csv('Brewer Friend Beer Recipes.csv', encoding = 'latin')
# Encoding the non numerical columns
def encoding_data(dataframe):
if(dataframe.dtype == 'object'):
return LabelEncoder().fit_transform(dataframe.astype(str))
else:
return dataframe
# Feature Selection using the selected Target Feature
def feature_selection(raw_dataframe, target_feature_list):
output_list = []
# preprocessing Converting Categorical data into Numeric Data
dataframe = raw_dataframe.apply(encoding_data)
column_list = dataframe.columns.tolist()
dataframe = dataframe.dropna()
for target in target_feature_list:
target_feature = target
x = dataframe.drop(columns=[target_feature])
y = dataframe[target_feature].values
# Lasso feature selection
estimator = LassoCV(cv = 3, n_alphas = 1)
featureselection = SelectFromModel(estimator)
featureselection.fit(x,y)
features = featureselection.transform(x)
feature_list = x.columns[featureselection.get_support()]
features = ''
features = ', '.join(feature_list)
l = (target,features)
output_list.append(l)
output_df = pd.DataFrame(output_list,columns = ['Name','Selected Features'])
print('\nThe Feature Selection is done with the respective target feature(s)')
return output_df
print(feature_selection(dataframe, ['BrewMethod']))
I am getting this warning and no features are selected.
"C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict. UserWarning)"
Any idea how to rectify this ?

If no features have been selected you can gradually decrease lambda (or in scikit's case alpha). This will reduce the penalization and probably return some nonzero coefficients.
It is extremely unusual that no coefficients have been selected. You should think about checking correlations in your data. Maybe you have a lot of collinearity.

Python running extremely slow one one line of code

I'm running the code below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:\\path_here\\train.csv')
test=pd.read_csv('C:\\path_here\\test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
ID_col = ['REF_NO']
target_col = ['Status']
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns)))
other_col=['Type'] #Test and Train Data set identifier
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with 0
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
It seems to run, endlessly, in this line.
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
I can't get it past that spot. how can I see what's happening in the background? Is there some way to see the work that's being done? Thanks.

One way to check where to code is getting to is to add print statements. For example you can add (right before the label encoder):
print("Code got before label encoder")
And then after that code block add another print statement. You can see in your console exactly where the code is getting stuck and debug that specific line.

Normalization sklearn

Let's say i have a pandas data frame, and i want to normalize only some attributes, but not the whole data frame with the help of this function:
preprocessing.normalize
And i want to inplace these normalized columns to my data frame.But i can't because it has different format(numpy array).
I have already seen how to do the normalization other ways, for example i did like this:
s0 = X.iloc[:,13:15]
X.iloc[:,13:15] = (s0 - s0.mean()) / (s0.max() - s0.min())
X.head()
But i really need to do it using sklearn.
Thanks, Stack!

What you are doing is Min-max scaling. "normalize" in scikit has different meaning then what you want to do.
Try MinMaxScaler.
And most of the sklearn transformers output the numpy arrays only. For dataframe, you can simply re-assign the columns to the dataframe like below example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['A', 'B', 'C'])
Now lets say you only want to min-max scale the columns A and C:
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
df[['A', 'C']] = minmax.fit_transform(df[['A', 'C']])

(s0 - s0.mean()) / (s0.max() - s0.min()) is called Mean normalization and as far as I am aware, there is no transformer in Scikit-learn to carry out this transformation.
The MinMaxScaler transforms following this formula: (s0 - s0.min()) / (s0.max() - s0.min())
You can do this transformation on selected variables with scikit-learn as follows:
dirty way:
scaler = MinMaxScaler() # or any other scaler from sklearn
scaler.fit(X[[var1, var2, var20]])
X_transf[[var1, var2, var20]] = scaler.transform(X[[var1, var2, var20]])
better way using the ColumnTransfomer:
features_numerical = [var1, var2, var20]
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[('numerical', numeric_transformer, features_numerical)], remainder='passthrough'}) # to keep all other features in the data set
preprocessor.fit_transform(X)
The returned variable is a numpy array, so needs re-casting into pandas dataframe and addition of variable names.
More information on how to use column transformer from sklearn here.
You need to import the ColumnTransformer and the Pipeline from sklearn, as well as the scaler of choice.

How use LinearRegression with categorical variables in sklearn

I am trying to perform some speed comparison test Python vs R and struggling with issue - LinearRegression under sklearn with categorical variables.
Code R:
# Start the clock!
ptm <- proc.time()
ptm
test_data = read.csv("clean_hold.out.csv")
# Regression Model
model_liner = lm(test_data$HH_F ~ ., data = test_data)
# Stop the clock
new_ptm <- proc.time() - ptm
Code Python:
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
start = time.time()
test_data = pd.read_csv("./clean_hold.out.csv")
x_train = [col for col in test_data.columns[1:] if col != 'HH_F']
y_train = ['HH_F']
model_linear = LinearRegression(normalize=False)
model_linear.fit(test_data[x_train], test_data[y_train])
but it's not work for me
return X.astype(np.float32 if X.dtype == np.int32 else np.float64)
ValueError: could not convert string to float: Bee True
I was tried another approach
test_data = pd.read_csv("./clean_hold.out.csv").to_dict()
v = DictVectorizer(sparse=False)
X = v.fit_transform(test_data)
However, I catched another error:
File
"C:\Anaconda32\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py",
line 258, in transform
Xa[i, vocab[f]] = dtype(v) TypeError: float() argument must be a string or a number
I don't understand how Python should resolve this issues ...
Example of data:
http://screencast.com/t/hYyyu7nU9hQm

I have to do some encoding before using fit.
There are several classes that can be used :
LabelEncoder : turn your string into incremental value
OneHotEncoder : use One-of-K algorithm to transform your String into integer
I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Feature selection and categorical variables - python

Related

ValueError: could not convert string to float: 'Q'

what is "UserWarning: No features were selected"

Python running extremely slow one one line of code

Normalization sklearn

How use LinearRegression with categorical variables in sklearn

Categories

Resources