I'm trying to build a spam mail classifier and I've collected multiple datasets from over internet(eg. SpamAssassin Database for spam/ham mails) and built this :
import os
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import KFold
from sklearn.metrics import confusion_matrix, f1_score
from sklearn import svm
NEWLINE = '\n'
HAM = 'ham'
SPAM = 'spam'
SOURCES = [
('C:/data/spam', SPAM),
('C:/data/easy_ham', HAM),
# ('C:/data/hard_ham', HAM), Commented out, since they take too long
# ('C:/data/beck-s', HAM),
# ('C:/data/farmer-d', HAM),
# ('C:/data/kaminski-v', HAM),
# ('C:/data/kitchen-l', HAM),
# ('C:/data/lokay-m', HAM),
# ('C:/data/williams-w3', HAM),
# ('C:/data/BG', SPAM),
# ('C:/data/GP', SPAM),
# ('C:/data/SH', SPAM)
]
SKIP_FILES = {'cmds'}
def read_files(path):
for root, dir_names, file_names in os.walk(path):
for path in dir_names:
read_files(os.path.join(root, path))
for file_name in file_names:
if file_name not in SKIP_FILES:
file_path = os.path.join(root, file_name)
if os.path.isfile(file_path):
past_header, lines = False, []
f = open(file_path, encoding="latin-1")
for line in f:
if past_header:
lines.append(line)
elif line == NEWLINE:
past_header = True
f.close()
content = NEWLINE.join(lines)
yield file_path, content
def build_data_frame(path, classification):
rows = []
index = []
for file_name, text in read_files(path):
rows.append({'text': text, 'class': classification})
index.append(file_name)
data_frame = DataFrame(rows, index=index)
return data_frame
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
data = data.append(build_data_frame(path, classification))
data = data.reindex(numpy.random.permutation(data.index))
pipeline = Pipeline([
('count_vectorizer', CountVectorizer(ngram_range=(1, 2))),
('classifier', svm.SVC(gamma=0.001, C=100))
])
k_fold = KFold(n=len(data), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
train_text = data.iloc[train_indices]['text'].values
train_y = data.iloc[train_indices]['class'].values.astype(str)
test_text = data.iloc[test_indices]['text'].values
test_y = data.iloc[test_indices]['class'].values.astype(str)
pipeline.fit(train_text, train_y)
predictions = pipeline.predict(test_text)
confusion += confusion_matrix(test_y, predictions)
score = f1_score(test_y, predictions, pos_label=SPAM)
scores.append(score)
print('Total emails classified:', len(data))
print('Support Vector Machine Output : ')
print('Score:' + str((sum(scores) / len(scores))*100) + '%')
print('Confusion matrix:')
print(confusion)
The lines which I've commented out are the collection of mails, even if I comment out most of the datasets and select the one with least amount of mails, it still runs extremely slow (~15minutes) and give accuracy of about 91%. How do I improve the speed and accuracy?
You are using kernel SVM. There are two problems with this.
Running Time Complexity of Kernel SVM: The first step in performing kernel SVM is building a similarity matrix, which becomes the feature set. With 30,000 documents, the number of elements in the similarity matrix becomes 90,000,000. This grows quickly as your corpus grows since the matrix grows the square of the number of documents in your corpus. This problem could be solved using using RBFSampler in scikit-learn, but you probably don't want to use that, for the next reason.
Dimensionality: You are using term and bigram counts as your feature set. This is an extremely high dimensional dataset. Using an RBF kernel in high dimensional spaces, even small differences (noise) can create a large impact in similarity results. See the curse of dimensionality. This is likely why your RBF kernel yields worse results than a linear kernel.
Stochastic Gradient Descent: SGD can be used instead of the standard SVM, and with good parameter tuning it may yield similar or possibly even better results. The drawback is SGD has more parameters to tune regarding the learning rate and learning rate schedule. Also, for few passes SGD is not ideal. In that case other algorithms like Follow The Regularized Leader (FTRL) will do better. Scikit-learn does not implement FTRL though. Using SGDClassifier with loss="modified_huber" often works well.
Now that we have the problems out of the way, there are several ways you can improve performance:
tf-idf weights: Using tf-idf, more common words are weighted less. This allows the classifier to better represent rare words that are more meaningful. This can be implemented by switching CountVectorizer to TfidfVectorizer
Parameter tuning: With linear SVM, there is no gamma parameter, but the C parameter can be used to greatly improve results. In the case of SGDClassifier, the alpha and learning rate parameters can be tuned as well.
ensembling: Running your model on multiple subsamples and averaging the result will often produce a robust model than a single run. This can be done in scikit-learn using the BaggingClassifier. Also combining different approaches can produce significantly better results. If substantially different approaches are used, consider using a stacked model with a tree model (RandomForestClassifier or GradientBoostingClassifier) as the last stage.
Related
I want to start develop an application using Machine Learning. I want to classify text - spam or not spam. I have 2 files - spam.txt, ham.txt - that contain thousand of sentences each file. If I want to use a classifier, let's say LogisticRegression.
For example, as I saw on the Internet, to fit my model I need to do like this:
`lr = LogisticRegression()
model = lr.fit(X_train, y_train)`
So here comes my question, what are actually X_train and y_train? How can I obtain them from my sentences? I searched on the Internet, I did not understand, here is my last call, I am pretty new to this topic. Thank you!
According to the documentation (see here):
X corresponds to your float feature matrix of shape (n_samples, n_features) (aka. the design matrix of your training set)
y is the float target vector of shape (n_samples,) (the label vector). In your case, label 0 could correspond to a spam example, and 1 to a ham one
The question is now about how to get a float feature matrix from text data.
A common scheme is to use a tf-idf vectorisation (more on this here), which is available in sklearn.
The vectorisation can be chained with the logistic regression via the Pipeline API of sklearn.
This is how the code would look like roughly
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
# prepare string data
with open('spam.txt', 'r') as f:
spam = f.readlines()
with open('ham.txt', 'r') as f:
ham = f.readlines()
text_train = list(chain(spam, ham))
# prepare labels
labels_train = np.concatenate((np.zeros(len(spam)),np.ones(len(ham))))
# build pipeline
vectorizer = TfidfVectorizer()
regressor = LogisticRegression()
pipeline = Pipeline([('vectorizer', vectorizer), ('regressor', regressor)])
# fit pipeline
pipeline.fit(text_train, labels_train)
# test predict
test = ["Is this spam or ham?"]
pipeline.predict(test) # value in [0,1]
I'm training a binary classifier using python and the popular scikit-learn module's SVM class. After training I use the predict method to make a classification as laid out in sci-kit's SVC documentation.
I would like to know more about the significance of my sample features to the resulting classification made by the trained decision_function (support vectors). Any strategies for evaluating feature significance when making predictions with such a model are welcome.
Thanks!
Andre
So, how do we interpret feature significance for a given sample's classification?
I think using a linear kernel is the most straightforward way to first approach this because of the significance/relative simplicity of the svc.coef_ attribute of a trained model. check out Bitwise's answer.
Below I will train a linear kernel SVM using scikit training data. Then we will look at the coef_ attribute. I will include a simple plot showing how the dot product of the classifier's coefficients and training feature data divide the resulting classes.
from sklearn import svm
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data # training features
y = data.target # training labels
lin_clf = svm.SVC(kernel='linear')
lin_clf.fit(X,y)
scores = np.dot(X, lin_clf.coef_.T)
b0 = y==0 # boolean or "mask" index arrays
b1 = y==1
malignant_scores = scores[b1]
benign_scores = scores[b1]
fig = plt.figure()
fig.suptitle("score breakdown by classification", fontsize=14, fontweight='bold')
score_box_plt = ply.boxplot(
[malignant_scores, benign_scores],
notch=True,
labels=list(data.target_names),
vert=False
)
plt.show(score_box_plt)
As you can see we do seem to have accessed the appropriate intercept and coefficient values. There is obvious separation of class scores with our decision boundary hovering around 0.
Now that we have a scoring system based on our linear coefficients we can easily investigate how each feature contributed to final classification. Here we display each features effect on the final score of that sample.
## sample we're using X[2] --> classified benign, lin_clf score~(-20)
lin_clf.predict(X[2].reshape(1,30))
contributions = np.multiply(X[2], lin_clf.coef_.reshape((30,)))
feature_number = np.arange(len(contributions)) +1
plt.bar(feature_number, contributions, align='center')
plt.xlabel('feature index')
plt.ylabel('score contribution')
plt.title('contribution to classification outcome by feature index')
plt.show(feature_contrib_bar)
We can also simply sort this same data to get a contribution-ranked list of features for a given classification to see which feature contributed the most to the score we are assessing the composition of.
abs_contributions = np.flip(np.sort(np.absolute(contributions)), axis=0)
feat_and_contrib = []
for contrib in abs_contributions:
if contrib not in contributions:
contrib = -contrib
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
else:
feat = np.where(contributions == contrib)
feat_and_contrib.append((feat[0][0], contrib))
# sorted by max abs value. each row a tuple:;(feature index, contrib)
feat_and_contrib
From that ranked list we can see that the top five feature indices that contributed to the final score (of around -20 along with a classification 'benign') were [0, 22, 13, 2, 21] which correspond to the feature names in our data set; ['mean radius', 'worst perimeter', 'area error', 'mean perimeter', 'worst texture'].
Suppose You have Bag of word Featurization and you want to know which words are important
for classification then use this code for linear svm
weights = np.abs(lr_svm.coef_[0])
sorted_index = np.argsort(wt)[::-1]
top_10 = sorted_index[:10]
terms = text_vectorizer.get_feature_names()
for ind in top_10:
print(terms[ind])
You can use SelectFromModel in sklearn to get the names of the most relevant features of your model. Here is an example of extracting the features for LassoCV.
You can also check out this example which makes use of coef_ attribute in SVM to visualize the top most features.
I am trying to use sklearn.mixture.GaussianMixture for classification of pixels in an hyper-spectral image. There are 15 classes (1-15). I tried using the method http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html. In here the mean is initialize with means_init,I also tried this but my accuracy is poor (about 10%). I also tried to change type of covariance, threshold, maximum iterations and number of initialization but the results are same.
Am I doing correct? Please provide inputs.
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy.io as sio
from sklearn.model_selection import train_test_split
uh_data =sio.loadmat('/Net/hico/data/users/nikhil/contest_uh_casi.mat')
data = uh_data['contest_uh_casi']
uh_labels = sio.loadmat('/Net/hico/data/users/nikhil/contest_gt_tr.mat')
labels = uh_labels['contest_gt_tr']
reshaped_data = np.reshape(data,(data.shape[0]*data.shape[1],data.shape[2]))
print 'reshaped data :',reshaped_data.shape
reshaped_label = np.reshape(labels,(labels.shape[0]*labels.shape[1],-1))
print 'reshaped label :',reshaped_label.shape
con_data = np.hstack((reshaped_data,reshaped_label))
pre_data = con_data[con_data[:,144] > 0]
total_data = pre_data[:,0:144]
total_label = pre_data[:,144]
train_data, test_data, train_label, test_label = train_test_split(total_data, total_label, test_size=0.30, random_state=42)
classifier = GaussianMixture(n_components = 15 ,covariance_type='diag',max_iter=100,random_state = 42,tol=0.1,n_init = 1)
classifier.means_init = np.array([train_data[train_label == i].mean(axis=0)
for i in range(1,16)])
classifier.fit(train_data)
pred_lab_train = classifier.predict(train_data)
train_accuracy = np.mean(pred_lab_train.ravel() == train_label.ravel())*100
print 'train accuracy:',train_accuracy
pred_lab_test = classifier.predict(test_data)
test_accuracy = np.mean(pred_lab_test.ravel()==test_label.ravel())*100
print 'test accuracy:',test_accuracy
My data has 66485 pixels and 144 features each. I also tried to do after applying some feature reduction techniques like PCA, LDA, KPCA etc, but the results are still the same.
Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. You should try out actual supervised techniques, since you clearly do have access to labels. Scikit-learn offers lots of these, including Random Forest, KNN, SVM, ... pick your favourite. GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.
GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem. It's not true that classification based on GMM works only for trivial problems. However it's based on mixture of Gauss components, so fits the best problems with high level features.
Your code incorrectly use GMM as classifier. You should use GMM as a posterior distribution, one GMM per each class.
I am trying my hand in Machine Learning and have been using python based Scikit library for it.
I wish to solve a 'Classification' problem in which a chunk of text (say of 1k-2k words) is classified into one or more category. For this I have been studying scikit for a while now.
As my data being in range 2-3 Million, so I was using SGDClassfier with HashingVectorizer for the purpose using partial_fit learning technique, coded as below:
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
import numpy as np
from sklearn.externals import joblib
import copy
data = pd.read_csv(
open('train_shuffled.csv'), error_bad_lines=False)
data_all = copy.deepcopy(data)
target = data['category']
del data['category']
cls = np.unique(target)
model = SGDClassifier(loss='log', verbose=1)
vect = HashingVectorizer(stop_words='english', strip_accents='unicode', analyzer='word')
loop = len(target) / 100
for passes in range(0, 5):
count, r = 0, 0
print("Pass " + str(passes + 1))
for q in range(0, loop):
d = nltk.word_tokenize(data['content'][r:r + 100])
d = vect.fit_transform(d)
t = np.array(target[r:r + 100])
model.partial_fit(d, t, cls)
r = r + 100
data = copy.deepcopy(data_all)
data = data.iloc[np.random.permutation(len(data))]
data = data.reset_index(drop=True)
target = data['category']
del data['category']
print(model)
joblib.dump(model, 'Model.pkl')
joblib.dump(vect, 'Vectorizer.pkl')
While going the learning process, I read in an answer here on stack that manually randomizing the training data on each iteration results into better model.
Using the Classifers and Vectorizers with default parameters, I got an accuracy score of ~58.4%. Since then, I have trying playing with different parameter setting for both Vectorizer and Classifier but no increase in accuracy.
Is anyone able to tell me, if something is wrong I have been doing or what should be done for improving the model score.
Any help will be highly appreciated.
Thanks!
1) consider using GridSearchCv to tune parameters. http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
2)consider feature engineering, to combine existing features into new features. E.G. use the polynomial features, feature selection and feature union tools provided in sklearn.
3) try different models. Not all models work on all problems. Try using an ensemble of simpler models and some kind of decision function to take the outputs of those models and make a prediction. Some are in the enesemble module, but you can use the voting classifiers to make your own.
but by far the best and most important thing to do, look at the data. Find examples of where the classifier performed badly. Why did it perform badly? Can you classify it from reading it (i.e. is it reasonable to expect an algo to classifier that text?). If it can be classified, what does the model miss.
All these will help guide what to do next.
I will be putting the max bounty on this as I am struggling to learn these concepts! I am trying to use some ranking data in a logistic regression. I want to use machine learning to make a simple classifier as to whether a webpage is "good" or not. It's just a learning exercise so I don't expect great results; just hoping to learn the "process" and coding techniques.
I have put my data in a .csv as follows :
URL WebsiteText AlexaRank GooglePageRank
In my Test CSV we have :
URL WebsiteText AlexaRank GooglePageRank Label
Label is a binary classification indicating "good" with 1 or "bad" with 0.
I currently have my LR running using only the website text; which I run a TF-IDF on.
I have a two questions which I need help with. I'll be putting a max bounty on this question and awarding it to the best answer as this is something I'd like some good help with so I, and others, may learn.
How can I normalize my ranking data for AlexaRank? I have a set of
10,000 webpages, for which I have the Alexa rank of all of them;
however they aren't ranked 1-10,000. They are ranked out of the
entire Internet, so while http://www.google.com may be ranked #1,
http://www.notasite.com may be ranked #83904803289480. How do I
normalize this in Scikit learn in order to get the best possible
results from my data?
I am running my Logistic Regression in this way; I am nearly sure I have done this incorrectly. I am trying to do the TF-IDF on the website text, then add the two other relevant columns and fit the Logistic Regression. I'd appreciate if someone could quickly verify that I am taking in the three columns I want to use in my LR correctly. Any and all feedback on how I can improve myself would also be appreciated here.
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])#Reading WebsiteText column for TF-IDF.
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1] #reading label
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
token_pattern=r'\w{1,}', ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
#Add Two Integer Columns
AlexaAndGoogleTrainData = list(np.array(p.read_table('train.tsv'))[2:,3])#Not sure if I am doing this correctly. Expecting it to contain AlexaRank and GooglePageRank columns.
AlexaAndGoogleTestData = list(np.array(p.read_table('test.tsv'))[2:,3])
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData + AlexaAndGoogleTrainData
#Add two columns to X.
X = np.append(X, AllAlexaAndGoogleInfo, 1) #Think I have done this incorrectly.
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."`
Thank you very much for all feedback - please post if you need any further information!
I guess sklearn.preprocessing.StandardScaler would be the first thing you want to try. StandardScaler transforms all of your features into Mean-0-Std-1 features.
This definitely gets rid of your first problem. AlexaRank will be guaranteed to be spread around 0 and bounded. (Yes, even massive AlexaRank values like 83904803289480 are transformed to small floating point numbers). Of course, the results will not be integers between 1 and 10000 but they will maintain same order as the original ranks. And in this case, keeping the rank bounded and normalized will help solve your second problem like follows.
In order to understand why normalization would help in LR, let's revisit the logit formulation of LR.
In your case, X1, X2, X3 are three TF-IDF features and X4, X5 are Alexa/Google rank related features. Now, the linear form of equation suggest that the coefficients represent the change in logit of y with one unit change in a variable. Think what happens when your X4 is kept fixed at a massive rank value, say 83904803289480. In that case, the Alexa Rank variable dominates your LR fit and a small change in TF-IDF value has almost no effect on the LR fit. Now one might think that the coefficient should be able to adjust to small/large values to account for differences between these features. Not in this case --- It's not only the magnitude of variables that matter but also their range. Alexa Rank definitely has a large range and should definitely dominate your LR fit in this case. Therefore, I guess normalizing all variables using StandardScaler to adjust their range will improve the fit.
Here is how you can scale the X matrix.
sc = proprocessing.StandardScaler().fit(X)
X = sc.transform(X)
Don't forget to use same scaler to transform X_test.
X_test = sc.transform(X_test)
Now you can use the fitting procedure etc.
rd.fit(X, y)
re.predict_proba(X_test)
Check this out for more on sklearn preprocessing: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: Parsing and column merging part can be easily done using pandas, i.e., there is no need to convert the matrices into list and then append them. Moreover, pandas dataframes can be directly indexed by their column names.
AlexaAndGoogleTrainData = p.read_table('train.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AlexaAndGoogleTestData = p.read_table('test.tsv', header=0)[["AlexaRank", "GooglePageRank"]]
AllAlexaAndGoogleInfo = AlexaAndGoogleTestData.append(AlexaAndGoogleTrainData)
Note that we are passing header=0 argument to read_table to maintain original header names from tsv file. And also note how we can index using entire set of columns. Finally, you can stack this new matrix with X using numpy.hstack.
X = np.hstack((X, AllAlexaAndGoogleInfo))
hstack horizontally combined two multi-dimensional array-like structures provided their lengths are same.
Regarding normalizing the numeric ranks either scikit StandardScaler or a logarithmic transform (or both) should work well enough.
For building up a working pipeline, I find my sanity greatly benefits from using the Pandas package and the sklearn.pipeline utilities. Here is a simple script that should do what you need.
First a couple of utlitlty classes I always seem to need. It would be nice to have something like these in sklearn.pipeline or sklearn.utilities.
from sklearn import base
class Columns(base.TransformerMixin, base.BaseEstimator):
def __init__(self, columns):
super(Columns, self).__init__()
self.columns_ = columns
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return X[self.columns_]
class Text(base.TransformerMixin, base.BaseEstimator):
def fit(self, *args, **kwargs):
return self
def transform(self, X, *args, **kwargs):
return (X.apply("\t".join, axis=1, raw=False))
Now set up the pipeline.
I used the SGDClassifier implementation of logistic regression since it tends to be more eficcient for high dimensional data like text classification also I usually find that hinge loss usually gives better results than logistic regression anyway.
from sklearn import linear_model as lin
from sklearn import metrics
from sklearn.feature_extraction import text as txt
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing as prep
import numpy as np
from pandas.io import parsers
import pandas as pd
pipe = Pipeline([
('feat', FeatureUnion([
('txt', Pipeline([
('txtcols', Columns(["WebsiteText"])),
('totxt', Text()),
('vect', txt.TfidfVectorizer()),
])),
('num', Pipeline([
('numcols', Columns(["AlexaRank", "GooglePageRank"])),
('scale', prep.StandardScaler()),
])),
])),
('clf', lin.SGDClassifier(loss="log")),
])
Next train the model:
train=parsers.read_csv("train.csv")
pipe.fit(train, train.Label)
Finally evaluate on test data:
test=parsers.read_csv("test.csv")
tstlbl=np.array(test.Label)
print pipe.score(test, tstlbl)
pred = pipe.predict(test)
print metrics.confusion_matrix(tstlbl, pred)
print metrics.classification_report(tstlbl, pred)
print metrics.f1_score(tstlbl, pred)
prob = pipe.decision_function(test)
print metrics.roc_auc_score(tstlbl, prob)
print metrics.average_precision_score(tstlbl, prob)
You will probably not get very good results with everything using default setting like this,
but it should give you a working baseline to work from. I can suggest some parameter settings that usually work for me if you like.