spark performing worse than scikit-learn for MultiClass - python

Currently I have text data and I am trying to predict a class. In my case I have 60 classes to choose from. When I deploy the model in random forest using scikit-learn, I get an f1 score of 78%.
However, I try to setup the model in pyspark and only get 30%. WAY TOO LOW! What is going on? Maybe I am not setting it up right. Also, with pyspark, random forest only is able to predict up to 12 labels where in my case I have 60.
Sci-kit learn code:
rf_model = Pipeline([
('featextract',FeatureExtractor()),
('union', FeatureUnion(
transformer_list=[
# pipeline for tfidf
('text', Pipeline([
('selector',ItemSelector(key='TEXT')),
('count_vec',TfidfVectorizer(max_features=5000)),
('tfidf', TfidfTransformer())])),
# pipeline for ata
('ata', Pipeline([
('selector', ItemSelector(key="ATA_SYS_NO")),
('atas',convert2dict()),
('vect',DictVectorizer())]))
])),
('model', OneVsRestClassifier(RandomForestClassifier(n_estimators=200,n_jobs=5))),
])
pySpark code
Tokenizer1 = Tokenizer(inputCol="TEXT",outputCol="words")
hashingTF = HashingTF(inputCol="words",outputCol="rawFeatures",numFeatures=4000)
idf = IDF(inputCol="rawFeatures",outputCol="tfidffeatures")
rf = RF(labelCol="componentIndex",featuresCol='tfidffeatures',numTrees=500)
pipeline = Pipeline(stages=[Tokenizer1,hashingTF,idf,labelIndexer,rf])
(trainingData,testData) = df.randomSplit([0.8,0.2])

Related

Get feature names and coefficients from sklearn pipeline

I have a pipeline that uses mlflow in Databricks and I would like to get the feature names and coefficients after I run the pipeline:
My pipeline looks like this:
sklr_classifier = LogisticRegression(
C=97.24899142002924,
penalty="l2",
random_state=956273824,
)
model = Pipeline([
("column_selector", col_selector),
("preprocessor", preprocessor),
("classifier", sklr_classifier),
])
pipe = model.fit(X_train, y_train)
I know I can access the coefficients with:
pipe.named_steps["classifier"].coef_.flatten()
But I would like to have the associated feature names.

Combine two sklearn pipelines into one

I have a text preprocessing Pipeline:
pipe = Pipeline([
('count_vectorizer', CountVectorizer()),
('chi2score', SelectKBest(chi2, k=1000)),
('tfidf_transformer', TfidfTransformer(norm='l2', use_idf=True)),
])
and I want to perform cross validation on a pipeline with multiple estimators. This is a solution that is working, but honestly I don't really like it. There should be a better way to do it. Maybe somehow convert the Pipeline to a transformer?
pipe_nb = Pipeline([*pipe.steps, ('naive_bayes', MultinomialNB())])
That's an approach that I perceive as an ideal one, but unfortunately it does not merge steps into new pipeline and causes issues.
pipe_nb = make_pipeline(
pipe,
MultinomialNB()
)
How to merge two pipelines into one, in a nice pythonic way?
Why do not simply append a new step into pipe.steps rather than to recreate a new one?
pipe.steps.append(('naive_bayes', MultinomialNB()))
print(pipe)
# Output
Pipeline(steps=[('count_vectorizer', CountVectorizer()),
('chi2score', SelectKBest(k=1000, score_func=1)),
('tfidf_transformer', TfidfTransformer()),
('naive_bayes', MultinomialNB())])
Or:
# Don't forget the *
pipe2 = make_pipeline(*pipe, MultinomialNB())
print(pipe2)
# Output
Pipeline(steps=[('countvectorizer', CountVectorizer()),
('selectkbest', SelectKBest(k=1000, score_func=1)),
('tfidftransformer', TfidfTransformer()),
('multinomialnb', MultinomialNB())])

How to Save and Load Machine Learning (One-vs-Rest) Models (PYTHON)

I have here my code wherein it loops through each label or category then creates a model out of it. However, what I want is to create a general model that will be able to accept new predictions that are inputs from a user.
I'm aware that the code below saves the model that is fit for the last category in the loop. How can I fix this so that models for each category will be saved so that when I load those models, i would be able to predict a label for a new text?
vectorizer = TfidfVectorizer(strip_accents='unicode',
stop_words=stop_words, analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['question_body'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['question_body'], axis=1)
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the SVC model using X_dtm & y
SVC_pipeline.fit(x_train, train[category])
# compute the testing accuracy of SVC
svc_prediction = SVC_pipeline.predict(x_test)
print("SVC Prediction:")
print(svc_prediction)
print('Test accuracy is {}'.format(f1_score(test[category], svc_prediction)))
print("\n")
#save the model to disk
filename = 'svc_model.sav'
pickle.dump(SVC_pipeline, open(filename, 'wb'))
There are multiple mistakes in your code.
You are fitting your TfidfVectorizer on both train and test:
vectorizer.fit(train_text)
vectorizer.fit(test_text)
This is wrong. Calling fit() is not incremental. It will not learn on both data if called two times. The most recent call to fit() will forget everything from past calls. You never fit (learn) something on test data.
What you need to do is this:
vectorizer.fit(train_text)
The pipeline does not work the way you think:
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See that you are passing LinearSVC inside the OneVsRestClassifier, so it will automatically use that without the need of Pipeline. Pipeline will not do anything here. Pipeline is of use when you sequentially want to pass your data through multiple models. Something like this:
pipe = Pipeline([
('pca', pca),
('logistic', LogisticRegression())
])
What the above pipe will do is pass the data to PCA which will transform it. Then that new data is passed to LogisticRegression and so on..
Correct usage of pipeline in your case can be:
SVC_pipeline = Pipeline([
('vectorizer', vectorizer)
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See more examples here:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline
You need to describe more about your "categories". Show some examples of your data. You are not using y_train and y_test anywhere. Is the categories different from "question_body"?

Scikit Learn: Predict category based on a number of text-based features

Predict which users find the review helpful (or "blank" if no one has found it useful so far). Either: 1) predict the string of users (assuming order is always alphabetical); or 2) for each user, predict whether or not they will find the review useful. For now there are a limited number of users (less than ten), and code for this is acceptable. But it is interesting to consider a future application which predicts many more users (let's say a few thousand or million possible users).
Sample data: train.csv
"id","title","review","user tags","user(s) who find review helpful"
"123","All movies!","I really love movies","love,all","Bill"
"456","No movies!","I really hate movies","hate,none","Jane"
"789","Great show!","That show was really great","great,really","Bill,Jane,Wanda"
"899","Interesting plot!","He makes the plot interesting","interesting,plot",""
"999","So tired!","The ending made me sleep","ending,tired,sleepy",""
Test: Use text from columns 1,2,3 to predict text-column 4. Ignore id numeric column 0.
So far I am using the guide here (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).
Current code:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
'''text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42,
max_iter=5, tol=None)),
])
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])'''
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf',
MLPClassifier(solver='lbfgs', alpha=1e-5,
hidden_layer_sizes=(50, 20),
random_state=1, shuffle=True, max_iter=200)
),
])
data = pd.read_csv('./train.csv',
error_bad_lines=False,header=None,sep=',',
dtype={
0: np.dtype('u8'), # id, 64-bit unsigned integer
2: np.dtype('U'), # title, unicode
3: np.dtype('U'), # review, unicode
4: np.dtype('U'), # tags, unicode
5: np.dtype('U'), # name(s), unicode
})
# TODO: Split user names column by comma.
xtr = data.iloc[0:100000,1:5].astype(str).values
ytr = data.iloc[0:100000,5].values
xtest = data.iloc[100001:101000,1:5].values
ytest = data.iloc[100001:101000,5]
text_clf.fit(xtr, ytr)
predicted = text_clf.predict(xtest)
print(np.mean(predicted == ytest))
Which produces the following output:
---> 38 predicted = text_clf.predict(data.iloc[100001:101000,5].values)
AttributeError: 'numpy.int64' object has no attribute 'lower'

Merging bag-of-words scikits classifier with arbitrary numeric fields

How would you merge a scikits-learn classifier that operates over a bag-of-words with one that operates on arbitrary numeric fields?
I know that these are basically the same thing behind-the-scenes, but I'm having trouble figuring out how to do this via the existing library methods. For example, my bag-of-words classifier uses the pipeline:
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])
classifier.fit(['some random text','some other text', ...], [CLS_A, CLS_B, ...])
Whereas my other usage is like:
classifier = LinearSVC()
classifier.fit([1.23, 4.23, ...], [CLS_A, CLS_B, ...])
How would I construct a LinearSVC classifier that could be trained using both sets of data simeltaneously? e.g.
classifier = ?
classifier.fit([('some random text',1.23),('some other text',4.23), ...], [CLS_A, CLS_B, ...])
The easy way:
import scipy.sparse
tfidf = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
])
X_tfidf = tfidf.fit_transform(texts)
X_other = load_your_other_features()
X = scipy.sparse.hstack([X_tfidf, X_other])
clf = LinearSVC().fit(X, y)
The principled solution, which allows you to keep everything in one Pipeline, would be to wrap hashing, tf-idf and your other feature extraction method in a few simple transformer objects and put these in a FeatureUnion, but it's hard to tell what the code would look like from the information you've given.
(P.S. As I keep saying on SO, on the mailing list and elsewhere, OneVsRestClassifier(LinearSVC()) is useless. LinearSVC does OvR out of the box, so this is just a slower way of fitting an OvR SVM.)

Categories