I have 9164 points, where 4303 are labeled as the class I want to predict and 4861 are labeled as not that class. They are no duplicate points.
Following How to split into train, test and evaluation sets in sklearn?, and since my dataset is a tuple of 3 items (id, vector, label), I do:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
def predict_labels(test_data, classifier):
labels = []
for test_d in test_data:
labels.append(classifier.predict([test_d]))
return np.array(labels)
def construct_labels(df):
labels = []
for index, row in df.iterrows():
if row[2] == 'Trump':
labels.append('Atomium')
else:
labels.append('Not Trump')
return np.array(labels)
def construct_data(df):
first_row = df.iloc[0]
data = np.array([first_row[1]])
for index, row in df.iterrows():
if first_row[0] != row[0]:
data = np.concatenate((data, np.array([row[1]])), axis=0)
return data
and then:
>>> classifier = SVC(verbose=True)
>>> classifier.fit(train_data, train_labels)
[LibSVM].......*..*
optimization finished, #iter = 9565
obj = -2718.376533, rho = 0.132062
nSV = 5497, nBSV = 2550
Total nSV = 5497
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=True)
>>> predicted_labels = predict_labels(test_data, classifier)
>>> for p, t in zip(predicted_labels, test_labels):
... if p == t:
... correct = correct + 1
and I get correct only 943 labels out of 1833 (=len(test_labels)) -> (943*100/1843 = 51.4%)
I am suspecting I am missing something big time here, maybe I should set a parameter to the classifier to do more refined work or something?
Note: First time using SVMs here, so anything you might get for granted, I might have not even imagine...
Attempt:
I went ahed and decreased the number of negative examples to 4303 (same number as positive examples). This slightly improved accuracy.
Edit after the answer:
>>> print(clf.best_estimator_)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> classifier = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
... decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
... max_iter=-1, probability=False, random_state=None, shrinking=True,
... tol=0.001, verbose=False)
>>> classifier.fit(train_data, train_labels)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Also I tried clf.fit(train_data, train_labels), which performed the same.
Edit with data (the data are not random):
>>> train_data[0]
array([ 20.21062112, 27.924016 , 137.13815308, 130.97432804,
... # there are 256 coordinates in total
67.76352596, 56.67798138, 104.89566517, 10.02616417])
>>> train_labels[0]
'Not Trump'
>>> train_labels[1]
'Trump'
Most estimators in scikit-learn such as SVC are initiated with a number of input parameters, also known as hyper parameters. Depending on your data, you will have to figure out what to pass as inputs to the estimator during initialization. If you look at the SVC documentation in scikit-learn, you see that it can be initialized using several different input parameters.
For simplicity, let's consider kernel which can be 'rbf' or ‘linear’ (among a few other choices); and C which is a penalty parameter, and you want to try values 0.01, 0.1, 1, 10, 100 for C. That will lead to 10 different possible models to create and evaluate.
One simple solution is to write two nested for-loops one for kernel and the other for C and create the 10 possible models and see which one is the best model amongst others. However, if you have several hyper parameters to tune, then you have to write several nested for loops which can be tedious.
Luckily, scikit learn has a better way to create different models based on different combinations of values for your hyper model and choose the best one. For that, you use GridSearchCV. GridSearchCV is initialized using two things: an instance of an estimator, and a dictionary of hyper parameters and the desired values to examine. It will then run and create all possible models given the choices of hyperparameters and finds the best one, hence you need not to write any nested for-loops. Here is an example:
from sklearn.grid_search import GridSearchCV
print("Fitting the classifier to the training set")
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf', 'linear']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)
clf = clf.fit(train_data, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
You will need to use something similar to this example, and play with different hyperparameters. If you have a good variety of values for your hyperparameters, there is a very good chance you will find a much better model this way.
It is however possible for GridSearchCV to take a very long time to create all these models to find the best one. A more practical approach is to use RandomizedSearchCV instead, which creates a subset of all possible models (using the hyperparameters) at random. It should run much faster if you have a lot of hyperparameters, and its best model is usually pretty good.
After the comments of sascha and the answer of shahins, I did this eventually:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
from sklearn.svm import SVC
# Classifier found with shahins' answer
classifier = SVC(C=10, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
classifier = classifier.fit(train_data, train_labels)
test_data = scaler.fit_transform(test_data)
predicted_labels = predict_labels(test_data, classifier)
and got:
>>> correct_labels = count_correct_labels(predicted_labels, test_labels)
>>> print_stats(correct_labels, len(test_labels))
Correct labels = 1624
Accuracy = 88.5979268958
with these methods:
def count_correct_labels(predicted_labels, test_labels):
correct = 0
for p, t in zip(predicted_labels, test_labels):
if p[0] == t:
correct = correct + 1
return correct
def print_stats(correct_labels, len_test_labels):
print "Correct labels = " + str(correct_labels)
print "Accuracy = " + str((correct_labels * 100 / float(len_test_labels)))
I was able to optimize more with more hyper parameter tuning!
Helpful link: RBF SVM parameters
Note: If I don't transform the test_data, accuracy is 52.7%.
Related
I have a sklearn pipeline that consists of a custom transformer, followed by XGBClassifier. What I would like to add as a final step in the transformer is another custom transformer that transforms the results of the XGBClassifier.
This last custom transformer will rank the predicted probabilities into ranks (5-percentiles).
Pipeline([
('custom_trsf1', custom_trsf1),
('clf', XGBCLassifier()),
('custom_trsf2', custom_trsf2)])
The problem is that the sklearn pipeline requires that all steps (but the last) should have a fit and transform method. Can I solve this in another way instead of extending the XGBclassifier and adding a transform method to it?
From seeing the source code of Pipeline implementation, the estimator used to fit the data goes on the last position of your steps, the _final_estimator property of Pipeline calls the last position of Pipeline's steps.
#property
def _final_estimator(self):
estimator = self.steps[-1][1]
return 'passthrough' if estimator is None else estimator
where steps might be something like
steps = [('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)),
('svc',
SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))]
The _final_estimator property is just called, after fitting all the transforms one after the other, to get the estimator to be fitted to the model, see line 333 for details.
So, considering steps, I can retrieve an SVC class from it's last position
final_estimator = steps[-1][1]
final_estimator
>>> SVC(C=1.0, ..., verbose=False)
and fit it the training data
final_estimator.fit(Xt, y)
where Xt is the transformed training data (calculated before fitting the estimator) and y the training target.
Using the 20newsgroups from the scikit learn for reproducibility. When I train an svm model and then perform data cleaning by removing headers, footers and quotes the accuracy decreases. Isn't it supposed to be improved by data cleaning? What is the point in doing all that and then get worse accuracy?
I have created this example with data cleaning to help you understand what I am referring at:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
categories = ['alt.atheism', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=2017,
remove=('headers', 'footers', 'quotes') )
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,shuffle=True, random_state=2017,
remove=('headers', 'footers', 'quotes') )
y_train = newsgroups_train.target
y_test = newsgroups_test.target
vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df=0.5, ngram_range=(1, 2),stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
from sklearn.svm import SVC
from sklearn import metrics
clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
clf = clf.fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print('Train accuracy_score: ', metrics.accuracy_score(y_train, y_train_pred))
print('Test accuracy_score: ',metrics.accuracy_score(newsgroups_test.target, y_test_pred))
print("-"*12)
print("Train Metrics: ", metrics.classification_report(y_train, y_train_pred))
print("-"*12)
print("Test Metrics: ", metrics.classification_report(newsgroups_test.target, y_test_pred))
Results before data cleaning:
Train accuracy_score: 1.0
Test accuracy_score: 0.9731638418079096
Results after data cleaning:
Train accuracy_score: 0.9887218045112782
Test accuracy_score: 0.9209039548022598
It is not necessarily your data cleaning, I assume you run the script twice?
The problem is this line of code:
clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
random_state=NoneYou should fix random state to e.g. random_state=42, otherwise you cannot produce the same result, if you would run this code again right now, you will again have a different result.
Edit:
The explanation is on the dataset site itself:
If you implement:
import numpy as np
def show_top10(classifier, vectorizer, categories):
feature_names = np.asarray(vectorizer.get_feature_names())
for i, category in enumerate(categories):
top10 = np.argsort(classifier.coef_[i])[-10:]
print("%s: %s" % (category, " ".join(feature_names[top10])))
You can now see many things that these features have overfit to:
Almost every group is distinguished by whether headers such as NNTP-Posting-Host: and Distribution: appear more or less often.
Another significant feature involves whether the sender is affiliated with a university, as indicated either by their headers or their signature.
The word “article” is a significant feature, based on how often people quote previous posts like this: “In article [article ID], [name] <[e-mail address]> wrote:”
Other features match the names and e-mail addresses of particular people who were posting at the time.
With such an abundance of clues that distinguish newsgroups, the classifiers barely have to identify topics from text at all, and they all perform at the same high level.
For this reason, the functions that load 20 Newsgroups data provide a parameter called remove, telling it what kinds of information to strip out of each file. remove should be a tuple containing any subset of
Summarize:
The remove thingy prevents you from data leakage, that means you have information in your training data which you will not have in a prediction phase, so you have to remove it, otherwise you will get a better result, but this will be not there for new data.
I would like to use class_weight to create a weighted SVC classifier in sikit-learn. Nevertheless, I'm not sure if I'm configuring correctly my model. Please consider the example below:
x = np.array([[0,0,1],[0,1,1],[1,0,0]])
y = np.array([1,1,0])
cw = {}
for l in set(y):
cw[l] = np.sum(y == l)
print(cw)
m = SVC(probability = True, max_iter = 1000, class_weight = cw)
m = m.fit(x,y)
I obtained the model:
SVC(C=1.0, cache_size=200, class_weight={0: 1, 1: 2}, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=1000, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
With class_weight={0: 1, 1: 2} corresponding to the number of data points in each class.
QUESTION: Is it correct to proceed in this way?
As you have a 2:1 ratio of class labels, this weighting appears to be correct.
One other thing you can do if you don't want to manually calculate the class weights is to pass class_weight='balanced' and let the SVC balance the weights for you
I have a video games Dataset with many categorical columns.
I binarized all these columns.
Now I want to predict a column (called Rating) with Logistic Regression, but this columns is now actually binarized into 4 columns (Rating_Everyone, Rating_Everyone10+, Rating_Teen and Rating_Mature).
So, I applied four times the Logistic Regression and here is my code:
df2 = pd.read_csv('../MQPI/docs/Video_Games_Sales_as_at_22_Dec_2016.csv', encoding="utf-8")
y = df2['Rating_Everyone'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
class_weight=None, random_state=None, solver='liblinear', max_iter=100,
multi_class='ovr',
verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)
print("Logistic Regression Rating_Everyone accuracy: ", ris)
And again:
y = df2['Rating_Everyone10'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
class_weight=None, random_state=None, solver='liblinear', max_iter=100,
multi_class='ovr',
verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)
print("Logistic Regression Rating_Everyone accuracy: ", ris)
And so on for Rating_Teen and Rating_Mature.
Can you tell me how to merge all these four results into one result OR how can I do this multiclass Logistic Regression problem better?
The LogisticRegression model is inherently handle multiclass problems:
Below is a summary of the classifiers supported by scikit-learn
grouped by strategy; you don’t need the meta-estimators in this class
if you’re using one of these, unless you want custom multiclass
behavior: Inherently multiclass: Naive Bayes, LDA and QDA, Decision
Trees, Random Forests, Nearest Neighbors, setting
multi_class='multinomial' in sklearn.linear_model.LogisticRegression.
As a basic model, without class weighting (as you may need to do as samples may not be balanced over the ratings) set multi_class='multinomial' and change the solver to 'lbfgs' or one
of the other solvers that support multiclass problems:
For multiclass problems, only ‘newton-cg’, ‘sag’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes
So you dont have to have to split your datasets up the way you have. Instead provide the original ratings column as the the labels.
Here is a minimal example:
X = np.random.randn(10, 10)
y = np.random.randint(1, 4, size=10) # 3 classes simulating ratings
lg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lg.fit(X, y)
lg.predict(X)
Edit: responding to comment.
td;lr: I expect that the model will learn that interaction on it own. IF not you might encode that information as a feature. So there is no obvious need to binarize your classes.
The way I understand it that you have features of a movies and you have the MPAA rating for the movie as the label (which you're trying to predict). This is then a multiclass problem which you can start modeling using logistic regression ( this you knew ). This is the model I proposed in above.
Now you recognized that there is a implicit distance between classes. The way I would use this information is as a feature for the model. However, I'd first be inclined to see of the model will learn this on its own.
For experimental purposes, I train the SVM model as follows,
clf = SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=True)
scores = cross_val_score(clf,train_feature,train_label,cv=3)
print(scores)
The printed result looks like as follows
Warning: using -h 0 may be faster
optimization finished, #iter = 2182
obj = -794.208203, rho = 1.303717
nSV = 1401, nBSV = 992
Total nSV = 1401
The cross-validation score is like
[LibSVM][LibSVM][LibSVM][ 0.68838493 0.6887449 0.75864138]
I think nSV represents the number of support vectors. Is that right? Then what do nBSV and rho represent? How can I know whether these cross-validations score are a good indicator for the model performance?