Python Scikit-Learn: Custom Analyzer for TfidfVectorizer - python

So i am trying to understand how to write a custom analyzer for python scikit-learn's TfidfVectorizer.
I am working on the following Kaggle competition
https://www.kaggle.com/c/whats-cooking
as a firs step, i do some clean up on the ingredients column as
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
after that i create a pipeline using TfidfVectorizer and the LogisticRegression classifier
pip = Pipeline([
('vect', TfidfVectorizer(
stop_words='english',
sublinear_tf=True,
use_idf=bestParameters['vect__use_idf'],
max_df=bestParameters['vect__max_df'],
ngram_range=bestParameters['vect__ngram_range']
)),
('clf', LogisticRegression(C=bestParameters['clf__C']))
])
then i fit my training set and finally, i predict
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
parameters = {}
grid_searchTS = GridSearchCV(pip,parameters,n_jobs=3, verbose=1, scoring='accuracy')
grid_searchTS.fit(X_train, y_train)
predictions = grid_searchTS.predict(X_test)
lastly, i check how my classifier did by
print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))
now this gives me around 78% accuracy. fine. now i basically perform the same steps but with one change. instead of creating a new column in the dataframe for a cleaned up version of the ingredients, i want to create a custom analyzer that will do the same thing. so write
def customAnalyzer(text):
lemTxt = ["".join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', ingred)) for ingred in lines.lower()]) for lines in sorted(text)]
return " ".join(lemTxt).strip()
and of course i change the pipeline as
pip = Pipeline([
('vect', TfidfVectorizer(
stop_words='english',
sublinear_tf=True,
use_idf=bestParameters['vect__use_idf'],
max_df=bestParameters['vect__max_df'],
ngram_range=bestParameters['vect__ngram_range'],
analyzer=customAnalyzer
)),
('clf', LogisticRegression(C=bestParameters['clf__C']))
])
lastly, since i think my customAnalyzer will take care of everything, i create my train test split as
X, y = traindf['ingredients'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
but to my surprise, my accuracy drops to 24% !
Is my intuition of using the custom analyzer in this way correct?
Do i also need to implement a custom tokenizer?
My intention is to use each ingredient as an independent entity. I do not want to deal with words. When i create my ngrams, i want the ngrams to be made out of each individual ingredient instead of each word.
How would i achieve this?

Related

Force RFECV to keep some features

I'm running features selection and I've been using RFECV to find the optimal number of features.
However, there are certain features I'd like to keep...so, I was wondering if there's any way to force the algorithm to keep these selected ones, and run the RFECV on the remaining ones.
So far, I'm running it on all of the features, by using:
def main():
df_data = pd.read_csv(csv_file_path, index_col=0)
X_train, y_train, X_test, y_test = split_data(df_data)
feats_selection(X_train, y_train, X_test, y_test)
def feats_selection(X_train, y_train, X_test, y_test):
nr_splits = 10
nr_repeats = 1
features_step = 1
est = DecisionTreeRegressor()
cv_mode = RepeatedKFold(n_splits=nr_splits, n_repeats=nr_repeats, random_state=1)
rfecv = RFECV(estimator=est, step=features_step, cv=cv_mode, scoring='neg_mean_squared_error', verbose=0)
## >>> here, the RFECV algorithm is automatically selecting the optimal features <<<
X_train_transformed = rfecv.fit_transform(X_train, y_train)
X_test_transformed = rfecv.transform(X_test)
## test on test subset
est.fit(X_train_transformed, y_train)
y_pred = est.predict(X_test_transformed)
rmse = mean_squared_error(y_test, y_pred, squared=False)
RFECV doesn't have such a parameter, no.
Perhaps the cleanest way to accomplish it uses a ColumnTransformer:
cols_to_always_keep = [...] # column names if you'll fit on dataframe, column indices otherwise
col_sel = ColumnTransformer(
transformers=['keep', "passthrough", cols_to_always_keep)],
remainder=rfecv,
)

sklearn.exceptions.NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator

I want to draw a decisiontree. But my data is text data. So I used Pipeline. However, the same error as the title appears. Please tell me how I can plot a tree with my data using graphviz or plot tree
data_files = 'dataset2-Komoran.xlsx'
data = pd.read_excel(data_files)
train_data = data[['title','category','processed_title']]
categories=train_data['category']
labels=list(set(categories))
n_classes=len(labels)
print('possible categories',labels)
for l in labels:
print('number of ', l, len(train_data.loc[train_data['category']==l]))
X_train, X_test, y_train, y_test = train_test_split(train_data['processed_title'],train_data['category'],test_size=0.2,random_state=57)
model = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', DecisionTreeClassifier()),
])
model.fit(X_train, y_train)
export_graphviz(model,
out_file='tree.dot'
)

Do I use the same Tfidf vocabulary in k-fold cross_validation

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation?
Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF SVM and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).
My code is as follows:
# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)
# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)
scoring = ['accuracy']
clf = SVC(kernel='linear')
scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
print(scores)
My confusion is that whether my method doing TF-IDF transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF Vector Model Space using train data and then transform into TF-IDF vectors with both train and test data? Just as follows:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC(kernel='linear')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)
The StratifiedKFold approach, which you had adopted to build the TfidfVectorizer() is the right way, by doing so you are making sure that features are generated only based out of the training dataset.
If you think about building the TfidfVectorizer() on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.
The simpler way could be using pipeline and cross_validate.
Use this!
from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))
scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores)
Note: It is not useful to do cross_validate on the test data alone. we have to do on the [train + validation] dataset.

using brand as a feature in product categorization

I am working on classifying products into categories using scikit-learn , I have trained a logistic regression model which has done well. I want to improve the performance of the classifier and I have been wondering if the brand of the product could have an impact on improving the classification. As you know in the pipeline of a classifier there are steps like stemming and tokenization etc so is there a way to add brand treatment as an extra feature to the pipeline so that the algorithm can used to classify products?
The data I am using is a dataset of products(title,brand,description etc...) with their categories as labels.
code:
with open('file_path') as json_data:
d = json.load(json_data)
def train(classifier, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print("X_train:")
print(len(X_train))
print("X_test:")
print(len(X_test))
print("y_train:")
print(len(y_train))
print("y_test:")
print(len(y_test))
classifier.fit(X_train, y_train)
print "Accuracy: %s" % classifier.score(X_test, y_test)
return classifier
target = np.asarray(d["target"])
data=d["data"]
categs=d["categs"]
trial = Pipeline([
('vectorizer', TfidfVectorizer(tokenizer=stemming_tokenizer,
stop_words=stopwords.words('french') + list(string.punctuation))),
('classifier', linear_model.LogisticRegression(C=1e2,n_jobs=-1)),
])
print("start train...")
clf=train(trial, data, target)
print("finished train")

How to use SVM in scikit learn to classify different test data for review spam detection

I am doing review spam detection using SVM in scikit learn. for this task i am using gold standard data set of truthful and deceptive reviews of each 400. Now i have done so far is to train and test split of this same dataset and find accuracy.
Now I want to train my SVM classifier using this dataset and then want to classify my new downloaded test data different then original data set.
How can I do this task. My code so far is:
def main():
init();
dir_path ='C:\spam\hotel-reviews'
files = sklearn.datasets.load_files(dir_path)
model = CountVectorizer()
X_train = model.fit_transform(files.data)
tf_transformer = sklearn.feature_extraction.text.TfidfTransformer(use_idf=True).fit(word_counts)
X = tf_transformer.transform(word_counts)
#print X
print '\n\n'
# create classifier
clf = sklearn.svm.LinearSVC()
# test the classifier
test_classifier(X, files.target, clf, test_size=0.2, y_names=files.target_names, confusion=False)
def test_classifier(X, y, clf, test_size=0.3, y_names=None, confusion=False):
#train-test split
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=test_size)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print sklearn.metrics.classification_report(y_test, y_predicted, target_names=y_names)
if __name__ == '__main__':
main()
Now i want to classify my own different review data of 500 reviews in reviews.txt file using above trained classifier, so how can i do this?
To score your data two steps are needed.
Either return clf and usea separate method for scoring or you can use within same method. This is the workflow
def scoreData(clf):
x_for_predict = loadScoringData("reviews.txt") # Signature only. assuming same data format without target variable
y_predict = clf.predict(x_for_predict)
plotResults(clf, y_predict)# just a signature.

Categories