Merging bag-of-words scikits classifier with arbitrary numeric fields - python

How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?
I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])
Whereas my other usage is like:
classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])
How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.
classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])

The easy way:
import scipy.sparse
tfidf = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)
X_other = load_your_other_features()
X = scipy.sparse.hstack([X_tfidf, X_other])
clf = LinearSVC().fit(X, y)
The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.
(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)

Related

Combine two sklearn pipelines into one

I have a text preprocessing Pipeline:
pipe = Pipeline([
('count_vectorizer', CountVectorizer()),
('chi2score', SelectKBest(chi2, k=1000)),
('tfidf_transformer', TfidfTransformer(norm='l2', use_idf=True)),
])
and I want to perform cross validation on a pipeline with multiple estimators. This is a solution that is working, but honestly I don't really like it. There should be a better way to do it. Maybe somehow convert the Pipeline to a transformer?
pipe_nb = Pipeline([*pipe.steps, ('naive_bayes', MultinomialNB())])
That's an approach that I perceive as an ideal one, but unfortunately it does not merge steps into new pipeline and causes issues.
pipe_nb = make_pipeline(
pipe,
MultinomialNB()
)
How to merge two pipelines into one, in a nice pythonic way?
Why do not simply append a new step into pipe.steps rather than to recreate a new one?
pipe.steps.append(('naive_bayes', MultinomialNB()))
print(pipe)
# Output
Pipeline(steps=[('count_vectorizer', CountVectorizer()),
('chi2score', SelectKBest(k=1000, score_func=1)),
('tfidf_transformer', TfidfTransformer()),
('naive_bayes', MultinomialNB())])
Or:
# Don't forget the *
pipe2 = make_pipeline(*pipe, MultinomialNB())
print(pipe2)
# Output
Pipeline(steps=[('countvectorizer', CountVectorizer()),
('selectkbest', SelectKBest(k=1000, score_func=1)),
('tfidftransformer', TfidfTransformer()),
('multinomialnb', MultinomialNB())])

How to Save and Load Machine Learning (One-vs-Rest) Models (PYTHON)

I have here my code wherein it loops through each label or category then creates a model out of it. However, what I want is to create a general model that will be able to accept new predictions that are inputs from a user.
I'm aware that the code below saves the model that is fit for the last category in the loop. How can I fix this so that models for each category will be saved so that when I load those models, i would be able to predict a label for a new text?
vectorizer = TfidfVectorizer(strip_accents='unicode',
stop_words=stop_words, analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['question_body'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['question_body'], axis=1)
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the SVC model using X_dtm & y
SVC_pipeline.fit(x_train, train[category])
# compute the testing accuracy of SVC
svc_prediction = SVC_pipeline.predict(x_test)
print("SVC Prediction:")
print(svc_prediction)
print('Test accuracy is {}'.format(f1_score(test[category], svc_prediction)))
print("\n")
#save the model to disk
filename = 'svc_model.sav'
pickle.dump(SVC_pipeline, open(filename, 'wb'))
There are multiple mistakes in your code.
You are fitting your TfidfVectorizer on both train and test:
vectorizer.fit(train_text)
vectorizer.fit(test_text)
This is wrong. Calling fit() is not incremental. It will not learn on both data if called two times. The most recent call to fit() will forget everything from past calls. You never fit (learn) something on test data.
What you need to do is this:
vectorizer.fit(train_text)
The pipeline does not work the way you think:
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See that you are passing LinearSVC inside the OneVsRestClassifier, so it will automatically use that without the need of Pipeline. Pipeline will not do anything here. Pipeline is of use when you sequentially want to pass your data through multiple models. Something like this:
pipe = Pipeline([
('pca', pca),
('logistic', LogisticRegression())
])
What the above pipe will do is pass the data to PCA which will transform it. Then that new data is passed to LogisticRegression and so on..
Correct usage of pipeline in your case can be:
SVC_pipeline = Pipeline([
('vectorizer', vectorizer)
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See more examples here:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline
You need to describe more about your "categories". Show some examples of your data. You are not using y_train and y_test anywhere. Is the categories different from "question_body"?

spark performing worse than scikit-learn for MultiClass

Currently I have text data and I am trying to predict a class. In my case I have 60 classes to choose from. When I deploy the model in random forest using scikit-learn, I get an f1 score of 78%.
However, I try to setup the model in pyspark and only get 30%. WAY TOO LOW! What is going on? Maybe I am not setting it up right. Also, with pyspark, random forest only is able to predict up to 12 labels where in my case I have 60.
Sci-kit learn code:
rf_model = Pipeline([
('featextract',FeatureExtractor()),
('union', FeatureUnion(
transformer_list=[
# pipeline for tfidf
('text', Pipeline([
('selector',ItemSelector(key='TEXT')),
('count_vec',TfidfVectorizer(max_features=5000)),
('tfidf', TfidfTransformer())])),
# pipeline for ata
('ata', Pipeline([
('selector', ItemSelector(key="ATA_SYS_NO")),
('atas',convert2dict()),
('vect',DictVectorizer())]))
])),
('model', OneVsRestClassifier(RandomForestClassifier(n_estimators=200,n_jobs=5))),
])
pySpark code
Tokenizer1 = Tokenizer(inputCol="TEXT",outputCol="words")
hashingTF = HashingTF(inputCol="words",outputCol="rawFeatures",numFeatures=4000)
idf = IDF(inputCol="rawFeatures",outputCol="tfidffeatures")
rf = RF(labelCol="componentIndex",featuresCol='tfidffeatures',numTrees=500)
pipeline = Pipeline(stages=[Tokenizer1,hashingTF,idf,labelIndexer,rf])
(trainingData,testData) = df.randomSplit([0.8,0.2])

sklearn pipeline - how to apply different transformations on different columns

I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py

Do you need to scale Vectorizers in sklearn?

I have a set of custom features and a set of features created with Vectorizers, in this case TfidfVectorizer.
All of my custom features are simple np.arrays (e.g. [0, 5, 4, 22, 1]). I am using StandardScaler to scale all of my featues, as you can see in my Pipeline by calling StandardScaler after my "custom pipeline". The question is whether there is a way or a need to scale the Vectorizers I use in my "vectorized_pipeline". Applying StandardScaler on the vectorizers doesn't seem to work (I get the following Error: "ValueError: Cannot center sparse matrices").
And another question, is it sensible to scale all of my features after I have joined them in the FeatureUnion or do I scale each of them separately (in my example, by calling the scaler in "pos_cluster" and "stylistic_features" seprately instead of calling it after the both of them have been joined), what is a better practice of doing this?
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
inner_scaler = StandardScaler()
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
# vectorizers
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('vectorized_pipeline', Pipeline([
('union_vectorizer', FeatureUnion([
('stem_text', Pipeline([
('selector', ItemSelector(key='stem_text')),
('stem_tfidf', countVecWord)
])),
('pos_text', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_tfidf', countVecWord_tags)
])),
])),
])),
('custom_pipeline', Pipeline([
('custom_features', FeatureUnion([
('pos_cluster', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_cluster_inner', pos_cluster)
])),
('stylistic_features', Pipeline([
('selector', ItemSelector(key='raw_text')),
('stylistic_features_inner', stylistic_features)
]))
])),
('inner_scale', inner_scaler)
])),
],
# weight components in FeatureUnion
# n_jobs=6,
transformer_weights={
'vectorized_pipeline': 0.8, # 0.8,
'custom_pipeline': 1.0 # 1.0
},
)),
('clf', classifier),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
First things first:
Error "Cannot center sparse matrices"
The reason is quite simple - StandardScaler efficiently applies feature-wise transformation:
f_i = (f_i - mean(f_i)) / std(f_i)
which for sparse matrices will result in the dense ones, as mean(f_i) will be non zero (usually). In practise only features equal to their means - will end up being zero. Scikit learn does not want to do this, as this is a huge modification of your data, which might result in failures in other parts of code, huge usage of memory etc. How to deal with it? If you really want to do this, there are two options:
densify your matrix through .toarray(), which will require lots of memory, but will give you exactly what you expect
create StandardScaler without mean, thus StandardScaler(with_mean = False) which instaed willl apply f_i = f_i / std(f_i), but will leave sparse format of your data.
Is scalind needed?
This is a whole other problem - usualy, scaling (of any form) is just a heuristics. This is not something that you have to apply, there are no guarantees that it will help, it is just a reasonable thing to do when you have no idea what your data looks like. "Smart" vectorizers, such as tfidf are actually already doing that. The idf transformation is supposed to create a kind of reasonable data scaling. There is no guarantee which one will be better, but in general, tfidf should be enough. Especially given the fact, that it still support sparse computations, while StandardScaler does not.

Categories