Use Python GridSearchCV to compare imputer methods?

Use Python GridSearchCV to compare imputer methods? - python

I'm working on preprocessing the Titanic data set in order to run it through some regressions.
It is the case that the "Age" column in the train and test sets is only populated for around 80% of the rows in each set.
Rather than just eliminate the rows that don't have an "Age" I'd like to use the SimpleImputer (from sklearn.impute import SimpleImputer) to fill in the missing values in those columns.
SimpleImputer has three options for the 'method' parameter that work with numeric data. These are mean, median, and most frequent (mode). (There's also the option to use a custom value, but because I'm trying to avoid "binning" the values I don't want to use this option.)
At its most basic, my approach would involve manually setting up the required datasets. I'd have to run one of each kind of imputer (imputer = SimpleImputer(strategy="xxxxxx") where xxxxxx = 'mean', 'median', or 'most frequent') on each of the train and test datasets and then end up with six different datasets that I'd then have to feed through my RandomForestRegressor one at a time.
I know that GridSearchCV can be used to exhaustively compare various combinations of parameter values in a regressor, so I'm wondering if anyone knows a way to use it or something similar to run through the various 'method' options of the imputer?
I'm thinking something along the lines of the following pseduocode -
param_grid = [
{'method': ['mean','median', 'most frequent']},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = 'neg_mean_squared_error')
grid_search.fit(titanic_features[method], titanic_values[method])
Is there a clean way to compare options like this?
Is there a better way to compare the three options than to build all six data sets, run them through the RF regressor and see what comes out?

Sklearn Pipeline are exactly meant for this. You have to create a pipeline with imputer component preceding the regressor. You can the then use grid search parameter grid with __ to pass the component specific parameters.
Sample code (documented inline)
# Sample/synthetic data shape 1000 X 2
X = np.random.randn(1000,2)
y = 1.5*X[:,0]+3.2*X[:, 1]+2.4
# Randomly make 200 data points in each axis as nan's
X[np.random.randint(0,1000, 200), 0] = np.nan
X[np.random.randint(0,1000, 200), 1] = np.nan
# Simple pipeline which has an imputer followed by regressor
pipe = Pipeline(steps=[('impute', SimpleImputer(missing_values=np.nan)),
('regressor', RandomForestRegressor())])
# 3 different imputers and 2 different regressors
# a total of 6 different parameter combination will be searched
param_grid = {
'impute__strategy': ["mean", "median", "most_frequent"],
'regressor__max_depth': [2,3]
}
# Run girdsearch
search = GridSearchCV(pipe, param_grid)
search.fit(X, y)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
Sample output:
Best parameter (CV score=0.730):
{'impute__strategy': 'median', 'regressor__max_depth': 3}
So with GridSearchCV we are able to find that the best impute strategy for our sample data is median with combination if max_dept of 3.
You can keep extending the pipeline with other components.

Related

How to get feature names when using onehot encoder on only certain columns sklearn

I have read many posts on this that reference the get_feature_names() from sklearn which appears to be now deprecated and replaced by get_feature_names_out neither of which I can get to work. It also appears that there is no way to use the get_feature_names (or the get_feature_names_out) with the ColumnTransformer class. So I am trying to fit and transform my numeric columns with a SimpleImputer and then StandardScaler class then SimpleImpute ('most_frequent') and OneHotEncode the categorical variables. I run them all individually since I can't put them in a pipeline then I try to get_feature_names and this results:
ValueError: input_features should have length equal to number of features (5), got 11
I have also tried getting feature names for just the categorical features as well as just the numeric and each one give the following errors respectively:
ValueError: input_features should have length equal to number of features (5), got 121942
and
ValueError: input_features should have length equal to number of features (5), got 121942
I am completely lost and also open to an easier way to get the feature names so that I can make sure the prod data that I run this model on after training/testing has the exact same features as the ones the model is trained to expect (which is the root issue here).
If I'm "barking up the wrong tree" by trying to get the feature names for the reasoning outlined in the root issue I'm also more than willing to be corrected. Here is my code:
#ONE HOT
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# !pip install -U scikit-learn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
numeric_columns = X.select_dtypes(include=['int64','float64']).columns
cat_columns = X.select_dtypes(include=['object']).columns
si_num = SimpleImputer(strategy='median')
si_cat = SimpleImputer(strategy='most_frequent')
ss = StandardScaler()
ohe = OneHotEncoder()
si_num.fit_transform(X[numeric_columns])
si_cat.fit_transform(X[cat_columns])
ss.fit_transform(X[numeric_columns])
ohe.fit_transform(X[cat_columns])
ohe.get_feature_names(X[numeric_columns])
Thanks!

I think this should work as a single composite estimator that does all your transformations and provides get_feature_names_out:
num_pipe = Pipeline([
("imp", si_num),
("scale", ss),
])
cat_pipe = Pipeline([
("imp", si_cat),
("ohe", ohe),
])
preproc = ColumnTransformer([
("num", num_pipe, numeric_columns),
("cat", cat_pipe, cat_columns),
])
Ideally, you should save the fitted composite and use that to transform production data, rather than using the feature names to reconcile different categories.
You should also fit this composite only on the training set, transforming the test set separately.

How to perform cross validation stratifying by a feature while also performing feature selection?

Let's say I have the following data (it doesn't make sense in terms of classification, it's just to illustrate the case):
import pandas as pd
import numpy as np
size = 1000
data = pd.DataFrame()
data['group'] = np.random.choice(['a', 'b', 'c'], size=size)
data['feature_a'] = np.random.binomial(n=1, p=0.05, size=size)
data['feature_b'] = np.random.randn(size)
data['label'] = np.random.binomial(n=1, p=0.3, size=size)
Now, let's say I want to perform cross validation on this dataset to build a LogisticRegression model, using sklearn.
I want the training set sample to be stratified according to the column/feature group. So it would make sense to me to use the StratifiedKFold cv method, passing the y parameter during the .split() stage as data['group'].
That would be possible by using a loop and enumerating the generated splits. But let's say I also want to perform feature selection, using, for example, sklearn.feature_selection.RFECV. It has a cv parameter, where I can specify the CV method. However, it assumes I'll stratify by the label column.
So how can I achieve this "pipeline"? Feature selection using stratified cross validation, but stratifying by a feature, not the label/class column.

Isolation Forest Sklearn for 1D array or list and how to tune hyper parameters

Is there a way to implement sklearn isolation forest for a 1D array or list? All the examples I came across are for data of 2 Dimension or more.
I have right now developed a model with three features and the example code snipped is mentioned below:
# dataframe of three columns
df_data = datafr[['col_A', 'col_B', 'col_C']]
w_train = page_data[:700]
w_test = page_data[700:-2]
from sklearn.ensemble import IsolationForest
# fit the model
clf = IsolationForest(max_samples='auto')
clf.fit(w_train)
#testing it using test set
y_pred_test = clf.predict(w_test)
The reference I mainly relied upon: IsolationForest example | scikit-learn
The df_data is a data frame with three columns. I am actually looking to find outlier in 1 Dimension or list data.
The other question is how to tune an isolation forest model? One of the ways is to increase the contamination value to reduce the false positives. But how to use the other parameters like n_estimators, max_samples, max_features, versbose, etc.

It won't make sense to apply Isolation forest to 1D array or list. This is because in that case it would simply be a one to one mapping from feature to target.
You can read the official documentation to get a better idea of the different parameters helps
contamination
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
Try experimenting with different values in range [0,0.5] to see which one gives the best results
max_features
The number of features to draw from X to train each base estimator.
Try values like 5,6,10, etc. any int of your choice and validate it with the final test data
n_estimators try multiple values like 10,20,50, etc. to see which works best.
You can also use GridSearchCV to automate this process of parameter estimation.
Just try experimenting with different values using gridSearchCV and see which one gives the best results.
Try this
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
my_scoring_func = make_scorer(f1_score)
parameters = {'n_estimators':[10,30,50,80], 'max_features':[0.1, 0.2, 0.3,0.4], 'contamination' : [0.1, 0.2, 0.3]}
iso_for = IsolationForest(max_samples='auto')
clf = GridSearchCV(iso_for, parameters, scoring=my_scoring_func)
Then use clf to fit the data. Although note that GridSearchCV requires bot x and y (i.e. train data and labels) for the fit method.
Note :You can read this blog post for further reference if you wish to use GridSearchCv with Isolation forest, else you can manually try with different values and plot graphs to see the results.

What does this error mean with StratifiedShuffleSplit?

I'm totally new to Data Science in general and was hoping someone could explain why this does not work:
I'm using the Advertising dataset from the following url: "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" which has 3 feature columns ("TV", "Radio", "Newspaper") and 1 label column ("sales"). My complete dataset is named data.
Next, I try to use sklearn's StratifiedShuffleSplit function to divide the data into training and testing sets.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
I get this ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Using the same code on another dataset which has 14 feature columns and 1 label column separates the data appropriately. Why doesn't it work here? Thanks.

I think that problem is your data_y is 2D matrix.
but as I see in sklearn.model_selection.StratifiedShuffleSplit doc, it should be the 1D vector. Try to encode each row of data_y as the integer (it will be interpreted as a class), and after use split.
Or possibly your y is a regression variable (continuous numerical data).(Vivek's link)

Using ranking data in Logistic Regression

I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.
I have put my data in a .csv as follows :
URL WebsiteText AlexaRank GooglePageRank
In my Test CSV we have :
URL WebsiteText AlexaRank GooglePageRank Label
Label is a binary classification indicating "good" with 1 or "bad" with 0.
I currently have my LR running using only the website text; which I run a TF-IDF on.
I have a two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others, may learn.
How can I normalize my ranking data for AlexaRank? I have a set of
10,000 webpages, for which I have the Alexa rank of all of them;
however they aren't ranked 1-10,000. They are ranked out of the
entire Internet, so while http://www.google.com may be ranked #1,
http://www.notasite.com may be ranked #83904803289480. How do I
normalize this in Scikit learn in order to get the best possible
results from my data?
I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any and all feedback on how I can improve myself would also be appreciated here.
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1] #reading label
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
#Add Two Integer Columns
AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.
AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData
#Add two columns to X.
X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."`
Thank you very much for all feedback - please post if you need any further information!

I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.
This definitely gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. (Yes, even massive AlexaRank values like 83904803289480 are transformed to small floating point numbers). Of course, the results will not be integers between 1 and 10000 but they will maintain same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem like follows.
In order to understand why normalization would help in LR, let's revisit the logit formulation of LR.
In your case, X1, X2, X3 are three TF-IDF features and X4, X5 are Alexa/Google rank related features. Now, the linear form of equation suggest that the coefficients represent the change in logit of y with one unit change in a variable. Think what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the LR fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not in this case --- It's not only the magnitude of variables that matter but also their range. Alexa Rank definitely has a large range and should definitely dominate your LR fit in this case. Therefore, I guess normalizing all variables using StandardScaler to adjust their range will improve the fit.
Here is how you can scale the X matrix.
sc = proprocessing.StandardScaler().fit(X)
X = sc.transform(X)
Don't forget to use same scaler to transform X_test.
X_test = sc.transform(X_test)
Now you can use the fitting procedure etc.
rd.fit(X, y)
re.predict_proba(X_test)
Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.
AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)
Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.
X = np.hstack((X, AllAlexaAndGoogleInfo))
hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.

Regarding normalizing the numeric ranks either scikit StandardScaler or a logarithmic transform (or both) should work well enough.
For building up a working pipeline, I find my sanity greatly benefits from using the Pandas package and the sklearn.pipeline utilities. Here is a simple script that should do what you need.
First a couple of utlitlty classes I always seem to need. It would be nice to have something like these in sklearn.pipeline or sklearn.utilities.
from sklearn import base
class Columns(base.TransformerMixin, base.BaseEstimator):
def __init__(self, columns):
super(Columns, self).__init__()
self.columns_ = columns
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return X[self.columns_]
class Text(base.TransformerMixin, base.BaseEstimator):
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return (X.apply("\t".join, axis=1, raw=False))
Now set up the pipeline.
I used the SGDClassifier implementation of logistic regression since it tends to be more eficcient for high dimensional data like text classification also I usually find that hinge loss usually gives better results than logistic regression anyway.
from sklearn import linear_model as lin
from sklearn import metrics
from sklearn.feature_extraction import text as txt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing as prep
import numpy as np
from pandas.io import parsers
import pandas as pd
pipe = Pipeline([
('feat', FeatureUnion([
('txt', Pipeline([
('txtcols', Columns(["WebsiteText"])),
('totxt', Text()),
('vect', txt.TfidfVectorizer()),
])),
('num', Pipeline([
('numcols', Columns(["AlexaRank", "GooglePageRank"])),
('scale', prep.StandardScaler()),
])),
])),
('clf', lin.SGDClassifier(loss="log")),
])
Next train the model:
train=parsers.read_csv("train.csv")
pipe.fit(train, train.Label)
Finally evaluate on test data:
test=parsers.read_csv("test.csv")
tstlbl=np.array(test.Label)
print pipe.score(test, tstlbl)
pred = pipe.predict(test)
print metrics.confusion_matrix(tstlbl, pred)
print metrics.classification_report(tstlbl, pred)
print metrics.f1_score(tstlbl, pred)
prob = pipe.decision_function(test)
print metrics.roc_auc_score(tstlbl, prob)
print metrics.average_precision_score(tstlbl, prob)
You will probably not get very good results with everything using default setting like this,
but it should give you a working baseline to work from. I can suggest some parameter settings that usually work for me if you like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.