Simultaneous feature selection and hyperparameter tuning

Simultaneous feature selection and hyperparameter tuning - python

I'm trying to conduct both hyperparameter tuning and feature selection on a sklearn SVC model.
I tried the below code, but am getting an error which I have included.
clf = Pipeline([('anova', SelectPercentile(f_classif)),
('svc', SVC( probability = True))])
score_means = list()
score_params = list()
percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)
params = {
"C": np.logspace(-3, 17, 21),
"gamma": np.logspace(-20, 1, 21),
'class_weight' : [None, 'balanced']
}
halving_search = HalvingGridSearchCV(estimator = clf,
param_grid = params,
scoring = 'neg_brier_score',
factor = 2,
verbose = 2,
cv = 2)
for percentile in percentiles:
clf.set_params(anova__percentile=percentile)
this_scores = halving_search.fit(x_train, y_train)
score_means.append(this_scores.best_score_)
score_params.append(this_scores.best_params)
Running the pipeline code with a cross_val_score separate from the HalvingGridSearchCV works, but I want to conduct both feature selection and hyperparameter tuning to find which combination of features and hyperparameters produces the best model.
When I run the above code, I get the following error:
Traceback (most recent call last):
File "<ipython-input-83-cf714445297c>", line 4, in <module>
this_scores = halving_search.fit(x_train, y_train)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search_successive_halving.py", line 213, in fit
super().fit(X, y=y, groups=None, **fit_params)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 841, in fit
self._run_search(evaluate_candidates)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search_successive_halving.py", line 320, in _run_search
more_results=more_results)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py", line 809, in evaluate_candidates
enumerate(cv.split(X, y, groups))))
File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 1041, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
self.results = batch()
File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 263, in __call__
for func, args, kwargs in self.items]
File "C:\Users\fredd\Anaconda3\lib\site-packages\joblib\parallel.py", line 263, in <listcomp>
for func, args, kwargs in self.items]
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
return self.function(*args, **kwargs)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 581, in _fit_and_score
estimator = estimator.set_params(**cloned_parameters)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 150, in set_params
self._set_params('steps', **kwargs)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\utils\metaestimators.py", line 54, in _set_params
super().set_params(**params)
File "C:\Users\fredd\Anaconda3\lib\site-packages\sklearn\base.py", line 233, in set_params
(key, self))
ValueError: Invalid parameter C for estimator Pipeline(steps=[('anova', SelectPercentile(percentile=1)),
('svc', SVC(probability=True))]). Check the list of available parameters with `estimator.get_params().keys()`.
It reads like the halvingsearch is trying to pass the pipeline as an input for C.

You want to perform a grid search over a Pipeline object. When defining the parameters for the different steps of the pipeline, you have to use the <step>__<parameter> syntax:
params = {
"svc__C": np.logspace(-3, 17, 21),
"svc__gamma": np.logspace(-20, 1, 21),
"svc__class_weight" : [None, 'balanced']
}
See the user guide for more information.

Related

GirdSearchCV for multioutput RandomForest Regressor

I have created a multioutput RandomForestRegressor using the sklearn.ensemble.RandomForestRegressor. I now want to perform a GridSearchCV to find good hyperparameters and output the r^2 scores for each individual target feature. The code is use looks as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
param_grid = {
'model__bootstrap': [True],
'model__max_depth': [8,10,12],
'model__max_features': [3,4,5],
'model__min_samples_leaf': [3,4,5],
'model__min_samples_split': [3, 5, 7],
'model__n_estimators': [100, 200, 300]
}
model = RandomForestRegressor()
pipe = Pipeline(steps=[
('scaler', StandardScaler()),
('model', model)])
scorer = make_scorer(r2_score, multioutput='raw_values')
search = GridSearchCV(pipe, param_grid, scoring=scorer)
search.fit(X_train, y_train)
print(f'Best parameter score {ship_type} {target}: {search.best_score_}')
When running this code I get the following error
File "run_xgb_rf_regressor.py", line 75, in <module>
model, X = run_regression(ship_types[2], targets)
File "run_xgb_rf_regressor.py", line 50, in run_regression
search.fit(X_train, y_train)
File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 841, in fit
self._run_search(evaluate_candidates)
File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 1296, in _run_search
evaluate_candidates(ParameterGrid(self.param_grid))
File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py", line 795, in evaluate_candidates
out = parallel(delayed(_fit_and_score)(clone(base_estimator),
File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/lucas/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/lucas/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
self.results = batch()
File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/lucas/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/utils/fixes.py", line 222, in __call__
return self.function(*args, **kwargs)
File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 625, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer, error_score)
File "/home/lucas/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 721, in _score
raise ValueError(error_msg % (scores, type(scores), scorer))
ValueError: scoring must return a number, got [0.57359176 0.54407165 0.40313057 0.32515033 0.346224 0.39513717
0.34375699] (<class 'numpy.ndarray'>) instead. (scorer=make_scorer(r2_score, multioutput=raw_values))
Clearly the error suggests that I can only use a single numeric value, which in my case would be the average r^2 score over all target features. Does anybody know how I can use GridSearchCV so that I can output the individual r^2 scores?
Many thanks in advance.

I think I would use the following option for scoring parameter (from the docs):
a callable returning a dictionary where the keys are the metric names and the values are the metric scores;
So something like
def my_scorer(estimator, X, y):
preds = estimator.predict(X)
scores = r2_score(y, preds, multioutput='raw_values')
return {f'r2_y{i}': score for i, score in enumerate(scores)}
Note though in the docs that refit will need to be set more carefully with multimetric searches. Maybe deciding the "best" parameters should be done by some average, in which case you can add another entry to the custom scorer.
Other useful parts of the User Guide:
https://scikit-learn.org/stable/modules/grid_search.html#multimetric-grid-search
https://scikit-learn.org/stable/modules/model_evaluation.html#implementing-your-own-scoring-object

Custom estimator can't be deepcopied by cross_val_score

I have a custom estimator that i implemented myself and i am not able to use cross_val_score(), which i believe it has something to do with my predict() method. Here is the full error trace:
Traceback (most recent call last):
File "/Users/joann/Desktop/Implementações ML/Adaboost Classifier/test.py", line 30, in <module>
ada2_score = cross_val_score(ada_2, X, y, cv=5)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 390, in cross_val_score
error_score=error_score)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 236, in cross_validate
for train, test in cv.split(X, y, groups))
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1004, in __call__
if self.dispatch_one_batch(iterator):
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
self._dispatch(tasks)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 209, in apply_async
result = ImmediateResult(func)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 590, in __init__
self.results = batch()
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 256, in __call__
for func, args, kwargs in self.items]
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 256, in <listcomp>
for func, args, kwargs in self.items]
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 544, in _fit_and_score
test_scores = _score(estimator, X_test, y_test, scorer)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 591, in _score
scores = scorer(estimator, X_test, y_test)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 89, in __call__
score = scorer(estimator, *args, **kwargs)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_scorer.py", line 371, in _passthrough_scorer
return estimator.score(*args, **kwargs)
File "/Users/joann/Desktop/Implementações ML/Adaboost Classifier/Adaboost.py", line 92, in score
scr_pred = self.predict(X)
File "/Users/joann/Desktop/Implementações ML/Adaboost Classifier/Adaboost.py", line 73, in predict
clf_pred = clf.predict(X)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn_extensions/extreme_learning_machines/elm.py", line 614, in predict
class_predictions = self.binarizer.inverse_transform(raw_predictions)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/_label.py", line 528, in inverse_transform
self.classes_, threshold)
File "/Users/joann/opt/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/_label.py", line 750, in _inverse_binarize_thresholding
format(y.shape))
ValueError: output_type='binary', but y.shape = (30, 3)
My predict(self, X) method returns a vector of size n_samples with the predictions for the X parameter. I also made a score() function as follows:
def score(self, X, y):
scr_pred = self.predict(X)
return sum(scr_pred == y) / X.shape[0]
This method simply computes the accuracy of the model given the samples. Either if i use this score() method or set a cross_val_score(... , scoring="accuracy") it is not working.
Note: i am aware of this question/answer but this doesn't apply to my case because i can confirm the consistence of my constructor:
def __init__(self, estimators=["MLP"], n_rounds=5, random_state=10):
self.estimators = estimators
self.n_rounds = n_rounds
self.random_state = random_state
UPDATE:
Further research led me to this topic, where it is explained that sklearn can't deepcopy Estimators with transformers. However, it is mandatory for my estimator to run LabelBinarizer to transform data to get the predictions. So i update the question title to the proper issue.`

However the problem statement of yours is not clear here but however looking at the error it seems you are trying a multiclass classification.
The problem here is that you might have in your code at some point have not done the preprocessing correctly as the error is logged from inverse_binarize_thresholding which is raised due to below functionality of sklearn pre-prosessing:
def _inverse_binarize_thresholding(y, output_type, classes, threshold):
if output_type == "binary" and y.ndim == 2 and y.shape[1] > 2:
raise ValueError("output_type='binary', but y.shape = {0}".
format(y.shape))
There must be some missing transformation or pre-prosessing in your code and you have to use LabelBinarizer correctly
Go through the below documentation and backtrack the error to fix your code
documentation

RandomizedSearchCV example from "Machine Learning with Python and H2O" manual not working

I'm a bit puzzled since I don't get the last example from "Machine Learning with Python and H2O" manual working (page 36).
Here's the code:
import h2o
h2o.init()
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.transforms.preprocessing import H2OScaler
from h2o.cross_validation import H2OKFold
from h2o.model.regression import h2o_r2_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics.scorer import make_scorer
h2o.__PROGRESS_BAR__=False
h2o.no_progress()
iris_data_path = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv"# load demonstration data1819In [5]:
iris_df = h2o.import_file(path=iris_data_path)
params = {"standardize__center": [True, False],
"standardize__scale": [True, False],
"gbm__ntrees": [10,20],
"gbm__max_depth": [1,2,3],
"gbm__learn_rate": [0.1,0.2]}
custom_cv = H2OKFold(iris_df, n_folds=5, seed=42)
pipeline = Pipeline([("standardize", H2OScaler()),
("gbm", H2OGradientBoostingEstimator(distribution="gaussian"))])
random_search = RandomizedSearchCV(pipeline, params, n_iter=5, scoring=make_scorer(h2o_r2_score),
cv=custom_cv, random_state=42, n_jobs=1)
random_search.fit(iris_df[1:], iris_df[0])
It returns the error ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed.
The full terminal message:
Traceback (most recent call last):
File "untitled-Copy1.py", line 34, in <module>
random_search.fit(iris_df[1:], iris_df[0])
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 710, in fit
self._run_search(evaluate_candidates)
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1484, in _run_search
random_state=self.random_state))
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 689, in evaluate_candidates
cv.split(X, y, groups)))
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/joblib/parallel.py", line 1004, in __call__
if self.dispatch_one_batch(iterator):
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
self._dispatch(tasks)
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/joblib/parallel.py", line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 209, in apply_async
result = ImmediateResult(func)
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 590, in __init__
self.results = batch()
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/joblib/parallel.py", line 256, in __call__
for func, args, kwargs in self.items]
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/joblib/parallel.py", line 256, in <listcomp>
for func, args, kwargs in self.items]
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 508, in _fit_and_score
X_train, y_train = _safe_split(estimator, X, y, train)
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/utils/metaestimators.py", line 201, in _safe_split
X_subset = _safe_indexing(X, indices)
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/utils/__init__.py", line 390, in _safe_indexing
indices_dtype = _determine_key_type(indices)
File "/department/jupyter-dev/anaconda3/envs/python36/lib/python3.6/site-packages/sklearn/utils/__init__.py", line 288, in _determine_key_type
raise ValueError(err_msg)
ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed
Closing connection _sid_b8c1 at exit
H2O session _sid_b8c1 closed.
I'm using python 3.6.10 with sklearn 0.22.1 and h2o 3.28.0.3.
What am I doing wrong? Any help appreciated!
Have a great day :)

Can not fit training data feature to match label data after Vectorizing

i have school project that demand me to use machine learning, after several troubleshoot i meet deadend, don't know how to solve it.
i have this code:
db_connection = 'mysql+pymysql://root:#localhost/databases'
conn = create_engine(db_connection)
df = pd.read_sql("SELECT * from barang", conn)
cth_data = pd.DataFrame(df)
#print(cth_data.head())
cth_data = cth_data.dropna()
y = cth_data['kode_aset']
x = cth_data[['merk','ukuran','bahan','harga']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)
vectorizer = CountVectorizer( max_features = 50000, ngram_range = ( 1,50 ) )
d_feture = vectorizer.fit_transform(x_train)
#d_label = vectorizer.transform(y_train)
clf.fit(d_feture, y_train)
t_data = vectorizer.transform(x_test)
y_pred=clf.predict(t_data)
print ("Model_Accuracy: " + str(np.mean(y_pred == y_test)))
i fetched the data from mysql database here is the database:
Screenshot of database:
ended up with this kind of error:
File "Machine_learn_V_0.0.1.py", line 41, in <module>
clf.fit(d_feture, y_train)
File "C:\Python35\lib\site-packages\sklearn\ensemble\forest.py", line 333, in fit
for i, t in enumerate(trees))
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 917, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 549, in __init__
self.results = batch()
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "C:\Python35\lib\site-packages\sklearn\externals\joblib\parallel.py", line 225, in <listcomp>
for func, args, kwargs in self.items]
File "C:\Python35\lib\site-packages\sklearn\ensemble\forest.py", line 119, in _parallel_build_trees
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
File "C:\Python35\lib\site-packages\sklearn\tree\tree.py", line 801, in fit
X_idx_sorted=X_idx_sorted)
File "C:\Python35\lib\site-packages\sklearn\tree\tree.py", line 236, in fit
"number of samples=%d" % (len(y), n_samples))
ValueError: Number of labels=223 does not match number of samples=4

CountVectorizer takes strings, it can not process columns as you wished it would, which means you should concatenated strings from cth_data[['merk','ukuran','bahan','harga']] into a single column, e.g.:
cols = ['merk','ukuran','bahan','harga']
cth_data['combined'] = cth_data[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
x = cth_data["combined"]
from there on your code should work

Custom Transformer and FeatureUnion for word2vec

I am trying to classify a set of text documents using multiple sets of features. I am using sklearn's Feature Union to combine different features for fitting into a single model. One of the features includes word embeddings using gensim's word2vec.
import numpy as np
from gensim.models.word2vec import Word2Vec
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data = fetch_20newsgroups(subset='train', categories=categories)#dummy dataset
w2v_model= Word2Vec(data .data, size=100, window=5, min_count=5, workers=2)
word2vec={w: vec for w, vec in zip(w2v_model.wv.index2word, w2v_model.wv.syn0)} #dictionary of word embeddings
feat_select = SelectKBest(score_func=chi2, k=10) #other features
TSVD = TruncatedSVD(n_components=50, algorithm = "randomized", n_iter = 5)
#other features
In order to include transformers/estimators not already available in sklearn, I am attempting to wrap my word2vec results into a custom transformer class that returns the vector averages.
class w2vTransformer(TransformerMixin):
"""
Wrapper class for running word2vec into pipelines and FeatureUnions
"""
def __init__(self,word2vec,**kwargs):
self.word2vec=word2vec
self.kwargs=kwargs
self.dim = len(word2vec.values())
def fit(self,x, y=None):
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
However when it comes time to fit the model I receive an error.
combined_features = FeatureUnion([("w2v_class",w2vTransformer(word2vec)),
("feat",feat_select),("TSVD",TSVD)])#join features into combined_features
#combined_features = FeatureUnion([("feat",feat_select),("TSVD",TSVD)])#runs when word embeddings are not included
text_clf_svm = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('feature_selection', combined_features),
('clf-svm', SGDClassifier( loss="modified_huber")),
])
text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data
text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data
Traceback (most recent call last):
File "<ipython-input-8-a085b7d40f8f>", line 1, in <module>
text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
**fit_params_steps[name])
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 739, in fit_transform
for name, trans, weight in self._iter())
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
self.results = batch()
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\base.py", line 520, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "<ipython-input-6-cbc52cd420cd>", line 16, in transform
for words in X
File "<ipython-input-6-cbc52cd420cd>", line 16, in <listcomp>
for words in X
File "<ipython-input-6-cbc52cd420cd>", line 14, in <listcomp>
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
TypeError: unhashable type: 'csr_matrix'
Traceback (most recent call last):
File "<ipython-input-8-a085b7d40f8f>", line 1, in <module>
text_clf_svm_1 = text_clf_svm.fit(data.data,data.target) # fits data
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
**fit_params_steps[name])
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 739, in fit_transform
for name, trans, weight in self._iter())
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 779, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py", line 332, in __init__
self.results = batch()
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 131, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\rlusk\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\base.py", line 520, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
File "<ipython-input-6-cbc52cd420cd>", line 16, in transform
for words in X
File "<ipython-input-6-cbc52cd420cd>", line 16, in <listcomp>
for words in X
File "<ipython-input-6-cbc52cd420cd>", line 14, in <listcomp>
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
TypeError: unhashable type: 'csr_matrix'
I understand that the error is because the variable "words" is a csr_matrix, but it needs to be an iterable such as a list. My question is how do I modify the transformer class or data so I can use the word embeddings as features to feed into FeatureUnion? This is my first SO post, please be gentle.

Instead of your custom transformer you can avoid the bug using the new scikit-learn API directly provided by Gensim! https://radimrehurek.com/gensim/sklearn_api/w2vmodel.html
Also, it depends on your version of Gensim, but in my case I could solve the same bug using the wv attribute of your word2vec object, instead of indexing on the object itself.
In the transform method of your w2vTransformer class:
self.word2vec.wv[w]
instead of
self.word2vec[w]
Hope it helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simultaneous feature selection and hyperparameter tuning - python

Related

GirdSearchCV for multioutput RandomForest Regressor

Custom estimator can't be deepcopied by cross_val_score

RandomizedSearchCV example from "Machine Learning with Python and H2O" manual not working

Can not fit training data feature to match label data after Vectorizing

Custom Transformer and FeatureUnion for word2vec

Categories

Resources