I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions
Implementaion
My FeatureUnion is built the following way:
features = FeatureUnion([
('f1', Pipeline([
('get', GetItemTransformer('f1')),
('transform', vectorizer_f1)
])),
('f2', Pipeline([
('get', GetItemTransformer('f2')),
('transform', vectorizer_f1)
]))
])
GetItemTransformer is used to get different parts of data out of the same structure. The Idea is described here in the scikit-learn issue-tracker.
The Structure itself is stored as {'f1': data_f1, 'f2': data_f2} where data_f1 are different lists with different lengths.
Question
Since the Y-Vector is different to the Data-Fields I assume that the error occurs, but how can I scale the vector to fit in both cases?
Here's what worked for me:
class ArrayCaster(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, data):
print data.shape
print np.transpose(np.matrix(data)).shape
return np.transpose(np.matrix(data))
FeatureUnion([('text', Pipeline([
('selector', ItemSelector(key='text')),
('vect', CountVectorizer(ngram_range=(1,1), binary=True, min_df=3)),
('tfidf', TfidfTransformer())
])
),
('other data', Pipeline([
('selector', ItemSelector(key='has_foriegn_char')),
('caster', ArrayCaster())
])
)])
I don't know if this applies to your question, but we ran into the same error in a slightly different situation and just solved it.
Our f1 entries were each lists of 15 numeric values and we needed to do tf-idf on f2. This generated the same error about incompatible row dimensions.
After running it through the debugger, we found that the shapes of our matrices were subtly different going into the hstack() call in FeatureUnion: (2569,) and (2659, 706).
If we cast f1 to a 2D numpy array, the shape changed to (2659, 15) and the hstack call works.
The cast was something like this: f1 = np.array(list(f1)).
Related
I want to classify some sentences with sklearn. The sentences are stored in a Pandas DataFrame.
To begin, I want to use the length of the sentence and it's TF-IDF vectors as a feature, so I created this pipeline:
pipeline = Pipeline([
('features', FeatureUnion([
('meta', Pipeline([
('length', LengthAnalyzer())
])),
('bag-of-words', Pipeline([
('tfidf', TfidfVectorizer())
]))
])),
('model', LogisticRegression())
where the LengthAnalyzer is a custom TransformerMixinwith:
def transform(self, documents):
for document in documents:
yield len(document)
So, LengthAnalyzer returns a number (1 dimension) while TfidfVectorizer returns a n-dimensional list.
When I try to run this, I get
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 494, expected 1.
What has to be done to make this feature combination work?
Seems like the problem is originating from the yield used in the transform(). Maybe due to yield the number of rows reported to the scipy hstack method is 1 instead of actual number of samples in documents.
There should be 494 rows (samples) in your data which is coming correct from TfidfVectorizer but LengthAnalyzer is only reporting a single row. Hence the error.
If you can change it to
return np.array([len(document) for document in documents]).reshape(-1,1)
then the pipeline fits successfully.
Note:
I tried finding any related issue on scikit-learn github but was unsuccessful. You can post this issue there to get some real feedback for the usage.
I want to debug my ML model and I want to see which features are used/activated in each observation in my test set. So I need to get the transformed matrix with the features after the vectorization, feature selection steps in the pipeline.
First of all here is the pipeline:
pipeline = Pipeline([
('fu', FeatureUnion(
transformer_list=[
('val', Pipeline([
('ext', FeatureExtractor(feat_type="valence")),
('vect', DictVectorizer()),
])),
('bot', Pipeline([
('ext', FeatureExtractor(feat_type="bot", term="word", pos=False, negation=True)),
('vec', CountVectorizer(min_df=3, max_df=0.9, lowercase=False)),
("fs", SelectKBest(chi2, k=8000)),
('bin', Binarizer()),
('trans', TfidfTransformer(sublinear_tf=True, smooth_idf=True, use_idf=True)),
]))
],
)),
('stats', FeatureStats()),
("fs", MaxAbsScaler()),
('classifier', svm.LinearSVC(C=.5)),
])
So I read the docs and it seems that I can use transform to get the transformed matrix (which is now deprecated, but anyway...). And i tried doing this:
pipeline.fit(X_train, y_train)
transformed = pipeline.transform(X_test)
y_predicted = pipeline.predict(X_test)
But I have a problem.
As you can see I select the the 8000 most discriminating features so the matrix that goes in and out of the classifier must be N x 8000. But the problem is that transformed is M x 3012 (N=training examples, M=test examples).
In order to verify that the pipeline is working correctly I put FeatureStats, a transformer that simply prints the length of the feature vectors, right before the classifier, and it prints 8000.
class FeatureStats(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def transform(self, X, y=None):
print("size = ", X.shape[1])
return X
def fit(self, X, y=None):
return self
Why does pipeline.transform(X_test) return a smaller matrix? Am I missing something?
I have a set of custom features and a set of features created with Vectorizers, in this case TfidfVectorizer.
All of my custom features are simple np.arrays (e.g. [0, 5, 4, 22, 1]). I am using StandardScaler to scale all of my featues, as you can see in my Pipeline by calling StandardScaler after my "custom pipeline". The question is whether there is a way or a need to scale the Vectorizers I use in my "vectorized_pipeline". Applying StandardScaler on the vectorizers doesn't seem to work (I get the following Error: "ValueError: Cannot center sparse matrices").
And another question, is it sensible to scale all of my features after I have joined them in the FeatureUnion or do I scale each of them separately (in my example, by calling the scaler in "pos_cluster" and "stylistic_features" seprately instead of calling it after the both of them have been joined), what is a better practice of doing this?
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
inner_scaler = StandardScaler()
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
# vectorizers
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('vectorized_pipeline', Pipeline([
('union_vectorizer', FeatureUnion([
('stem_text', Pipeline([
('selector', ItemSelector(key='stem_text')),
('stem_tfidf', countVecWord)
])),
('pos_text', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_tfidf', countVecWord_tags)
])),
])),
])),
('custom_pipeline', Pipeline([
('custom_features', FeatureUnion([
('pos_cluster', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_cluster_inner', pos_cluster)
])),
('stylistic_features', Pipeline([
('selector', ItemSelector(key='raw_text')),
('stylistic_features_inner', stylistic_features)
]))
])),
('inner_scale', inner_scaler)
])),
],
# weight components in FeatureUnion
# n_jobs=6,
transformer_weights={
'vectorized_pipeline': 0.8, # 0.8,
'custom_pipeline': 1.0 # 1.0
},
)),
('clf', classifier),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
First things first:
Error "Cannot center sparse matrices"
The reason is quite simple - StandardScaler efficiently applies feature-wise transformation:
f_i = (f_i - mean(f_i)) / std(f_i)
which for sparse matrices will result in the dense ones, as mean(f_i) will be non zero (usually). In practise only features equal to their means - will end up being zero. Scikit learn does not want to do this, as this is a huge modification of your data, which might result in failures in other parts of code, huge usage of memory etc. How to deal with it? If you really want to do this, there are two options:
densify your matrix through .toarray(), which will require lots of memory, but will give you exactly what you expect
create StandardScaler without mean, thus StandardScaler(with_mean = False) which instaed willl apply f_i = f_i / std(f_i), but will leave sparse format of your data.
Is scalind needed?
This is a whole other problem - usualy, scaling (of any form) is just a heuristics. This is not something that you have to apply, there are no guarantees that it will help, it is just a reasonable thing to do when you have no idea what your data looks like. "Smart" vectorizers, such as tfidf are actually already doing that. The idf transformation is supposed to create a kind of reasonable data scaling. There is no guarantee which one will be better, but in general, tfidf should be enough. Especially given the fact, that it still support sparse computations, while StandardScaler does not.
Each sample in my (iid) dataset looks like this:
x = [a_1,a_2...a_N,b_1,b_2...b_M]
I also have the label of each sample (This is supervised learning)
The a features are very sparse (namely bag-of-words representation), while the b features are dense (integers,there are ~45 of those)
I am using scikit-learn, and I want to use GridSearchCV with pipeline.
The question: is it possible to use one CountVectorizer on features type a and another CountVectorizer on features type b?
What I want can be thought of as:
pipeline = Pipeline([
('vect1', CountVectorizer()), #will work only on features [0,(N-1)]
('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)]
('clf', SGDClassifier()), #will use all features to classify
])
parameters = {
'vect1__max_df': (0.5, 0.75, 1.0), # type a features only
'vect1__ngram_range': ((1, 1), (1, 2)), # type a features only
'vect2__max_df': (0.5, 0.75, 1.0), # type b features only
'vect2__ngram_range': ((1, 1), (1, 2)), # type b features only
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
'clf__n_iter': (10, 50, 80),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(X, y)
Is that possible?
A nice idea was presented by #Andreas Mueller.
However, I want to keep the original non-chosen features as well... therefore, I cannot tell the column index for each phase at the pipeline upfront (before the pipeline begins).
For example, if I set CountVectorizer(max_df=0.75), it may reduce some terms, and the original column index will change.
Thanks
Unfortunately, this is currently not as nice as it could be. You need to use FeatureUnion to concatenate to kinds of features, and the transformer in each needs to select the features and transform them.
One way to do that is to make a pipeline of a transformer that selects the columns (you need to write that yourself) and the CountVectorizer. There is an example that does something similar here. That example actually separates the features as different values in a dictionary, but you don't need to do that.
Also have a look at the related issue for selecting columns which contains code for the transformer that you need.
It would looks something like this with the current code:
make_pipeline(
make_union(
make_pipeline(FeatureSelector(some_columns), CountVectorizer()),
make_pipeline(FeatureSelector(other_columns), CountVectorizer())),
SGDClassifier())
How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?
I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])
Whereas my other usage is like:
classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])
How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.
classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])
The easy way:
import scipy.sparse
tfidf = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)
X_other = load_your_other_features()
X = scipy.sparse.hstack([X_tfidf, X_other])
clf = LinearSVC().fit(X, y)
The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.
(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)