Feature extraction per fold with Scikit - python

for a machine learning experiment I have to perform feature selection. I have no division in training and test set because of 10-fold cross validation. Someone told me that I have to do the feature selection per fold. But I have no idea how to do that. Here is a part of my code.
vec = DictVectorizer()
X = vec.fit_transform(instances) # No train/ test set, because we'll use 10-fold cross validation
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
#feature selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif
feat_sel = SelectKBest(mutual_info_classif, k=200)
X_fs = feat_sel.fit_transform(X_scaled, y)
#train a classifier
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
y_pred = model_selection.cross_val_predict(clf, X_fs, y, cv=10)
Can someone help me with the selection per fold?

Answering the second question that you posted.
You can use cross validation and see the results:
Do:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFECV
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = MultinomialNB()
pipe = Pipeline([('mutual_info',feat_sel), ('naive_bayes',clf)])
scores = cross_val_score(pipe, X_scaled, y, cv =10, scoring = 'accuracy')
print(np.mean(scores))

You can use Pipeline, join the feature selector and the classifier into a pipeline and cross-validate the pipeline.
Reference: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Related

How do you use GridSearchCV to search values for alpha in lasso and ridge regression models?

I'm doing an exercise on using sklearn Pipelines and GridSearchCV to find values for values for alpha in lasso and ridge regression models, where we also use SimpleImputer to take care of some missing values.
However, I keep getting the following traceback error:
ValueError: Invalid parameter alpha for estimator Pipeline
However, I'm certain that Sklearn's Ridge and Lasso both have the alpha parameter and have double checked on the sklearn docs.
Where am I going wrong? My code is below:
import numpy as np
import pandas as pd
from sklearn import svm, datasets
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn import metrics
columns = ["sex","length","diam","height","whole","shucked","viscera","shell","age"]
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",names=columns)
y = df.age
X=df.drop(columns=['age'])
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include=['object']).columns
#create some missing values
for i in range(1000):
X.loc[np.random.choice(X.index),np.random.choice(X.columns)] = np.nan
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)
cat_vals = Pipeline([("imputer",SimpleImputer(strategy='most_frequent')), ("ohe",OneHotEncoder(sparse=False, drop='first'))])
num_vals = Pipeline([("imputer",SimpleImputer(strategy='mean')), ("scale",StandardScaler())])
preprocess = ColumnTransformer(
transformers=[
("cat_process", cat_vals, cat_cols),
("num_process", num_vals, num_cols)
]
)
#Create a pipeline with preprocess and a linear regression model
pipeline = Pipeline([("preprocess",preprocess),
("regr",LinearRegression())])
#simple parameter grid, with and without the intercept
param_grid = {
'regr': [LinearRegression()],
"regr__fit_intercept": [True,False]
}
# 1. Update the dictionary to include values for alpha in lasso and ridge regression models. Use np.logspace(-4,2,10).
param_grid['alpha'] = np.logspace(-4,2,10)
param_grid['regr'] = [Lasso(), Ridge()]
# 2. Fit the GridSearchCV on the training data and print the best estimator and score from the search.
gscv = GridSearchCV(pipeline, param_grid, scoring='neg_mean_squared_error', cv=5)
gscv.fit(x_train, y_train)
print(gscv.best_params_)
print(gscv.best_score_)
print(gscv.best_estimator_)
The parameter name should be given a prefix according to the pipeline stage name, i.e. regr__alpha - just like you did for the fit_intercept.

How to feed data into random forest classifier and see prediction

I have build a random forest classifier using scikit learn and python, and I am having trouble actually feeding data in to see the prediction. I want to see the format of the output, and to convert this to a json file. I have attached the code for the random forest and what the data looks like. I believe I need to use 'y_pred', but I am not sure what format the input data needs to be.
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state = 0)
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
You can simply concatenate the predicted values with the matrix of features.
Also note that the pipeline is exactly for this purpose, when you first want to transform the data and then apply some classifier.
This should work for you:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
classifier = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=20, random_state=0))
classifier = classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
pred = pd.concat([X_test, pd.Series(y_pred, name="pages")], axis=1)

how to generate the confusion matrix through cross validation in python?

I am using the iris flower dataset to do the sorting. I need to make a confusion matrix through cross validation (fold = 10) but I don't know how to do it. I generated the confusion matrix of only one round.
# I am using TPOT autoML library for python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from sklearn.preprocessing import LabelEncoder
tpot_data = pd.read_csv('iris.csv')
tpot_data = tpot_data.apply(LabelEncoder().fit_transform)
features = tpot_data.drop('species', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['species'].values, random_state=10)
exported_pipeline = make_pipeline(StackingEstimator(estimator=GaussianNB()),
MultinomialNB(alpha=0.01, fit_prior=False)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(testing_target, results))
pd.crosstab(testing_target, results, rownames=['Actual Class'], colnames=['Predicted Class'])
from sklearn.model_selection import cross_val_score
array_cross_val_score = cross_val_score(estimator=exported_pipeline, X=training_features,
y=training_target, cv=10, scoring='accuracy')
# I would like the confusion matrix to be based on the average cross-validation
np.mean(array_cross_val_score)

How does cross_val_score and gridsearchCV works?

I am new to python and I have been trying to figure out how gridsearchCV and cross_val_score work.
Finding odds results a set up a sort of validation experiment, but still I do not understand what I am doing wrong.
To try to simplify I am using gridsearchCV is the simplest possible way and try to validate and understand what is happening:
Here it is:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFECV
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV,Ridge, LinearRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV,KFold,TimeSeriesSplit,PredefinedSplit,cross_val_score
from sklearn.metrics import mean_squared_error,make_scorer,r2_score,mean_absolute_error,mean_squared_error
from math import sqrt
I create a cross validation object (for gridsearchCV and cross_val_score) and a train/test dataset for pipeline and simple linear regression. I have checked that the two dataset are identical:
train_indices = np.full((15,), -1, dtype=int)
test_indices = np.full((6,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
kf = PredefinedSplit(test_fold)
for train_index, test_index in kf.split(X):
print('TRAIN:', train_index, 'TEST:', test_index)
X_train_kf = X[train_index]
X_test_kf = X[test_index]
train_data = list(range(0,15))
test_data = list(range(15,21))
X_train, y_train=X[train_data,:],y[train_data]
X_test, y_test=X[test_data,:],y[test_data]
Here is what I do:
instantiate a simple linear model and use it with the manual set of data
lr=LinearRegression()
lm=lr.fit(X,y)
lmscore_train=lm.score(X_train,y_train)
->r2=0.4686662249071524
lmscore_test=lm.score(X_test,y_test)
->r2 0.6264021467338086
now I try do do the exact same things using a pipeline:
pipe_steps = ([('est', LinearRegression())])
pipe=Pipeline(pipe_steps)
p=pipe.fit(X,y)
pscore_train=p.score(X_train,y_train)
->r2=0.4686662249071524
pscore_test=p.score(X_test,y_test)
->r2 0.6264021467338086
LinearRegression and pipeline matches perfectly
Now I try to do the same by using cross_val_score using the predefined split kf
cv_scores = cross_val_score(lm, X, y, cv=kf)
->r2 = -1.234474757883921470e+01?!?! (this is supposed to be the test score)
Now let's try gridsearchCV
scoring = {'r_squared':'r2'}
grid_parameters = [{}]
gridsearch=GridSearchCV(p, grid_parameters, verbose=3,cv=kf,scoring=scoring,return_train_score='true',refit='r_squared')
gs=gridsearch.fit(X,y)
results=gs.cv_results_
from cv_results_ I get once again
->mean_test_r_squared->r2->-1.234474757883921292e+01
So cross_val_score and gridsearch in the end match one another, but the score is totally off and different from what should be.
Will you please help me out solving this puzzle?
cross_val_score and GridSearchCV will first split the data, train the model on the train data only and then score on test data.
Here you are training on the full data, and then scoring on test data. Hence you dont match the results of cross_val_score.
Instead of this:
lm=lr.fit(X,y)
Try this:
lm=lr.fit(X_train, y_train)
Same for pipeline:
Instead of p=pipe.fit(X,y), do this:
p=pipe.fit(X_train, y_train)
You can look at my answers for more description:-
https://stackoverflow.com/a/42364900/3374996
https://stackoverflow.com/a/42230764/3374996

Include feature extraction in pipeline sklearn

For a text classification project I made a pipeline for the feature selection and the classifier. Now my question is if it is possible to include the feature extraction module in the pipeline and how. I looked some things up about it, but it doesn't seem to fit with my current code.
This is what I have now:
# feature_extraction module.
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction import DictVectorizer
import numpy as np
vec = DictVectorizer()
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)]))
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)
How can I put the dictvectorizer until the label encoder in the pipeline?
Here's how you would do it. Assuming instances is a dict-like object, as specified in the API, then just build your pipeline like so:
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
To predict, then call cross_val_predict, passing instances as X:
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)

Categories