Include feature extraction in pipeline sklearn - python

For a text classification project I made a pipeline for the feature selection and the classifier. Now my question is if it is possible to include the feature extraction module in the pipeline and how. I looked some things up about it, but it doesn't seem to fit with my current code.
This is what I have now:
# feature_extraction module.
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction import DictVectorizer
import numpy as np
vec = DictVectorizer()
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)]))
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)
How can I put the dictvectorizer until the label encoder in the pipeline?

Here's how you would do it. Assuming instances is a dict-like object, as specified in the API, then just build your pipeline like so:
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
To predict, then call cross_val_predict, passing instances as X:
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)

Related

How do you use GridSearchCV to search values for alpha in lasso and ridge regression models?

I'm doing an exercise on using sklearn Pipelines and GridSearchCV to find values for values for alpha in lasso and ridge regression models, where we also use SimpleImputer to take care of some missing values.
However, I keep getting the following traceback error:
ValueError: Invalid parameter alpha for estimator Pipeline
However, I'm certain that Sklearn's Ridge and Lasso both have the alpha parameter and have double checked on the sklearn docs.
Where am I going wrong? My code is below:
import numpy as np
import pandas as pd
from sklearn import svm, datasets
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn import metrics
columns = ["sex","length","diam","height","whole","shucked","viscera","shell","age"]
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",names=columns)
y = df.age
X=df.drop(columns=['age'])
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include=['object']).columns
#create some missing values
for i in range(1000):
X.loc[np.random.choice(X.index),np.random.choice(X.columns)] = np.nan
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=0, test_size=0.25)
cat_vals = Pipeline([("imputer",SimpleImputer(strategy='most_frequent')), ("ohe",OneHotEncoder(sparse=False, drop='first'))])
num_vals = Pipeline([("imputer",SimpleImputer(strategy='mean')), ("scale",StandardScaler())])
preprocess = ColumnTransformer(
transformers=[
("cat_process", cat_vals, cat_cols),
("num_process", num_vals, num_cols)
]
)
#Create a pipeline with preprocess and a linear regression model
pipeline = Pipeline([("preprocess",preprocess),
("regr",LinearRegression())])
#simple parameter grid, with and without the intercept
param_grid = {
'regr': [LinearRegression()],
"regr__fit_intercept": [True,False]
}
# 1. Update the dictionary to include values for alpha in lasso and ridge regression models. Use np.logspace(-4,2,10).
param_grid['alpha'] = np.logspace(-4,2,10)
param_grid['regr'] = [Lasso(), Ridge()]
# 2. Fit the GridSearchCV on the training data and print the best estimator and score from the search.
gscv = GridSearchCV(pipeline, param_grid, scoring='neg_mean_squared_error', cv=5)
gscv.fit(x_train, y_train)
print(gscv.best_params_)
print(gscv.best_score_)
print(gscv.best_estimator_)
The parameter name should be given a prefix according to the pipeline stage name, i.e. regr__alpha - just like you did for the fit_intercept.

How to feed data into random forest classifier and see prediction

I have build a random forest classifier using scikit learn and python, and I am having trouble actually feeding data in to see the prediction. I want to see the format of the output, and to convert this to a json file. I have attached the code for the random forest and what the data looks like. I believe I need to use 'y_pred', but I am not sure what format the input data needs to be.
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state = 0)
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
You can simply concatenate the predicted values with the matrix of features.
Also note that the pipeline is exactly for this purpose, when you first want to transform the data and then apply some classifier.
This should work for you:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
classifier = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=20, random_state=0))
classifier = classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
pred = pd.concat([X_test, pd.Series(y_pred, name="pages")], axis=1)

how to generate the confusion matrix through cross validation in python?

I am using the iris flower dataset to do the sorting. I need to make a confusion matrix through cross validation (fold = 10) but I don't know how to do it. I generated the confusion matrix of only one round.
# I am using TPOT autoML library for python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from sklearn.preprocessing import LabelEncoder
tpot_data = pd.read_csv('iris.csv')
tpot_data = tpot_data.apply(LabelEncoder().fit_transform)
features = tpot_data.drop('species', axis=1).values
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['species'].values, random_state=10)
exported_pipeline = make_pipeline(StackingEstimator(estimator=GaussianNB()),
MultinomialNB(alpha=0.01, fit_prior=False)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(testing_target, results))
pd.crosstab(testing_target, results, rownames=['Actual Class'], colnames=['Predicted Class'])
from sklearn.model_selection import cross_val_score
array_cross_val_score = cross_val_score(estimator=exported_pipeline, X=training_features,
y=training_target, cv=10, scoring='accuracy')
# I would like the confusion matrix to be based on the average cross-validation
np.mean(array_cross_val_score)

Machine Learning Algorithm does not work after Vectorizing a feature that is of type text

I am trying to classify and my features are a combination of words, number and text. I am trying to vectorize the feature that is of type text but when I run it through a classifying algorithm it throws the following error.
line 51, in
classifier.fit(X_train, y_train.values.ravel())
ValueError: setting an array element with a sequence.
Below is my code.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
df = pd.read_csv('data.csv')
df = df[pd.notnull(df['memo'])]
df = df[pd.notnull(df['name'])]
# factorize type, name, and categorized account
df['type_id'] = df.txn_type.factorize()[0]
df['name_id'] = df.name.factorize()[0]
df['categorizedAccountId'] = df.categorizedAccount.factorize()[0]
my_list = df['categorizedAccountId'].tolist()
print(my_list)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
memoFeatures = tfidf.fit_transform(df.memo)
df['memo_id'] = pd.Series(memoFeatures, index=df.index)
X = df.loc[:, ['type_id', 'name_id', 'memo_id']]
y = df.loc[:, ['categorizedAccountId']]
X_train, X_test, y_train, y_test = train_test_split(X, y)
'''print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
'''
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train.values.ravel())
y_pred = classifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test)))
And also here are a few rows of my Data. The top row has the labels and the categorized account is the class
"txn_type","name","memo","account","amount","categorizedAccount"
"Journal","","ABC.com 11/29/16 Payments",0,207.24,"1072 ABC.com Money Out Clearing"
"Bill Payment","College Tuition Fund","Multiple inv. (details on stub)",164,-207.24,"1072 ABC.com Money Out Clearing"
Ok so I have implemented some modifications to your code, which I paste here. This snippet goes immediately after you read the csv, and drop the null rows. You have to implement the train_test_split yourself though.
df['categorizedAccount'] = df['categorizedAccount'].astype('category')
df['all_text'] = df['txn_type'] + ' ' + df['name'] + ' ' + df['memo']
X = df['all_text']
y = df['categorizedAccount']
X_train = X # Change these four lines for train_test_split
X_test = X # I don't have enough rows in the mock dataset to implement it,
y_train = y # And it returns an error
y_test = y
tfidf = TfidfVectorizer()
X_train_transformed = tfidf.fit_transform(X_train)
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train_transformed, y_train)
X_test_transformed = tfidf.transform(X_test)
y_pred = classifier.predict(X_test_transformed)
classifier.score(X_test_transformed, y_pred)
A few comments though:
from sklearn.feature_extraction.text import TfidfVectorizer
Imported once, ok
from io import StringIO
Unnecessary as far as I can see
from sklearn.feature_extraction.text import TfidfVectorizer
Why do you import it again?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
TfidfVectorizer does the job of both CountVectorizer and TfidfTransformer. From sklearn: "Equivalent to CountVectorizer followed by TfidfTransformer." See here for more
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
Not used, do not import.
Additionally:
1) It is not clear what you are trying to do with factorize. TfidfVectorizer automatically performs tokenization for any string of text that you provide it. All columns that you have selected in your original code contain only strings, so it makes more sense to concatenate them and let tfidf do the tokenization, rather than trying to do it yourself.
2) Use the Pipeline constructor, it will save your life.
3) X = df.loc[:, ['type_id', 'name_id', 'memo_id']] This type of splicing looks very bad, just call df[['column_name_1','column_name_2','column_name_3']]
4) And remember PEP20, "Simple is better than complex"!
As a last advice, when developing a ML model it's always better to start with something plain and simple, and then develop further once you have something that works.

Feature extraction per fold with Scikit

for a machine learning experiment I have to perform feature selection. I have no division in training and test set because of 10-fold cross validation. Someone told me that I have to do the feature selection per fold. But I have no idea how to do that. Here is a part of my code.
vec = DictVectorizer()
X = vec.fit_transform(instances) # No train/ test set, because we'll use 10-fold cross validation
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
#feature selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif
feat_sel = SelectKBest(mutual_info_classif, k=200)
X_fs = feat_sel.fit_transform(X_scaled, y)
#train a classifier
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
y_pred = model_selection.cross_val_predict(clf, X_fs, y, cv=10)
Can someone help me with the selection per fold?
Answering the second question that you posted.
You can use cross validation and see the results:
Do:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFECV
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = MultinomialNB()
pipe = Pipeline([('mutual_info',feat_sel), ('naive_bayes',clf)])
scores = cross_val_score(pipe, X_scaled, y, cv =10, scoring = 'accuracy')
print(np.mean(scores))
You can use Pipeline, join the feature selector and the classifier into a pipeline and cross-validate the pipeline.
Reference: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Categories