I experimented with breast cancer data from scikit-learn.
Use all features and not use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 1 : 0.9473684210526315
Use all features and use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 2 : 0.9736842105263158
Use only two features and not use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 3 : 0.37719298245614036
Use only two features and use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 4 : 0.9824561403508771
As result1, result2, result3, result4 show , accuracy has much improvement with Standardscaler while training with fewer features.
So I wondering Why does the standardscaler have different effects under different number of features?
PS. Here is the two featrues I choose:
TL;DR
Don't do feature selection as long as you do not understand fully why you're doing it and in which way it may assist your algo in learning and generalizing better. For starter, please read http://www.feat.engineering/selection.html by Max Kuhn
Full read.
I suspect you tried to select a best feature subset and encountered a situation where a [arbitrary] subset performed better than the whole dataset. StandardScaling is out of question here because it's considered a standard preprocessing procedure for the algo of yours. So your real question should be "Why a subset of features perform better than a full dataset?"
Why your selection algo is arbitrary? 2 reasons.
First. Nobody has proven most linearly correlated feature would improve your [or any other if you wish] algo. Second. The best feature subset is different from what is necessitated by best correlated features.
Let's see this with code.
A feature subset giving best accuracy (note a)
Lets do a brute force.
acc_bench = 0.9736842105263158 # accuracy on all features
res = {}
f = x_train.shape[1]
pcpt = Perceptron(n_jobs=-1)
from itertools import combinations
for i in tqdm(range(2,10)):
features_list = combinations(range(f),i)
for features in features_list:
pcpt.fit(x_train[:,features],y_train)
preds = pcpt.predict(x_test[:, features])
acc = accuracy_score(y_test, preds)
if acc > acc_bench:
acc_bench = acc
res["accuracy"] = acc_bench
res["features"] = features
print(res)
{'accuracy': 1.0, 'features': (0, 15, 22)}
So you see, that features [0,15,22] give perfect accuracy over validation dataset.
Do best features have anything to do with correlation to target?
Let's find a list orderd by a degree of linear correlation.
featrues = pd.DataFrame(cancer.data, columns=cancer.feature_names)
target = pd.DataFrame(cancer.target, columns=['target'])
cancer_data = pd.concat([featrues,target], axis=1)
features_list = np.argsort(np.abs(cancer_data.corr()['target'])[:-1].values)[::-1]
feature_list
array([27, 22, 7, 20, 2, 23, 0, 3, 6, 26, 5, 25, 10, 12, 13, 21, 24,
28, 1, 17, 4, 8, 29, 15, 16, 19, 14, 9, 11, 18])
You see, that best feature subset found by brute force has nothing to do with correlation.
Can linear correlation explain accuracy of Perceptron?
Let's try plotting num of feature from the above list (starting with 2 most correlated) vs resulting accuracy.
res = dict()
for i in tqdm(range(2,10)):
features=features_list[:i]
pcpt.fit(x_train[:,features],y_train)
preds = pcpt.predict(x_test[:, features])
acc = accuracy_score(y_test, preds)
res[i]=[acc]
pd.DataFrame(res).T.plot.bar()
plt.ylim([.9,1])
Once again, linear correlated features have nothing to do with perceptron accuracy.
Conclusion.
Don't select feature prior to any algo unless you're perfectly sure what you're doing and what would be the effects of doing this. Do not mix up diffrent selection and learning algos because different algos have diferent opinions of what is important and what is not. A feature unimportant for one algo may become important for another. This is especially true for linear vs nonlinear algos.
If you want to improve accuracy of your algo do data cleaning or feature engineering instead.
Related
#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)
I am encountering a very weird situation.
I am trying to use SVM in sklearn for a binary classification task. Here is my code:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
svc = SVC(kernel='rbf', class_weight='balanced', gamma='auto',probability=True)
c_range = np.logspace(-5, 15, 11, base=2)
gamma_range = np.logspace(-9, 3, 13, base=2)
param_grid = [{'kernel': ['rbf'], 'C': c_range, 'gamma': gamma_range}]
grid = GridSearchCV(svc, param_grid, cv=5, n_jobs=-1)
clf = grid.fit(x_train, y_train)
predictions = grid.predict(x_test)
As you can see, this is a very simple model where x_train is the input for training data, x_test is the input for testing data and y_train is the label for training data.
My question is, since I didn't set any seed, why did this code always reproduce the same results? In my understanding, the randomness should exists in my model and there should be at least a few variant results.
Let me be a little bit clear, I am not complaining my model only predicts the same class for all the testing data. I am complaining that even I set different seed, my model still produces the consistent results.
For example, assuming I have 3 testing data. When I set seed to 1 or 2 or others, the predictions for testing data are always [1,0,1].
I have tried to set different seed, changing random_state in the model. Nothing works.
My training data is very small, only a couple hundred. Testing data is larger, with thousands of data.
The code below will return different class probabilities for different values of random_state in SVC. The fact that the predicted classes are identical across different runs simply means that there is not much ambiguity about the classes the data points belong to. In other words, if your data points look like this, they are easily separable and models with different seeds will assign the same classes to the same points.
In practice, if a first model assigns for instance to a data point the probabilities {A: 0.942, B: 0.042, C: 0.016} and another model with a different seed assigns the probabilities {A: 0.917, B: 0.048, C: 0.035}, then both models will predict the same class A for this point.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
data = load_wine()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)
x_test = scaler.transform(X_test)
svc = SVC(kernel='rbf', class_weight='balanced', gamma='auto', probability=True, random_state=50)
c_range = np.logspace(-5, 15, 11, base=2)
gamma_range = np.logspace(-9, 3, 13, base=2)
param_grid = [{'kernel': ['rbf'], 'C': c_range, 'gamma': gamma_range}]
grid = GridSearchCV(svc, param_grid, cv=5, n_jobs=-1)
clf = grid.fit(x_train, y_train)
predictions = grid.predict_proba(x_test)
print(predictions)
Also, most of your data should be used for training, not for testing.
how am I supposed to implement Gaussian Naive Bayes, in two training sets.
I need:
Create a training set by selecting the rows with id <= 160
Train a Gaussian Naive-Bayes classifier as we saw in class to determine if a campaign will be successful, given the amounts used in each marketing channel
Calculate the fraction of the training set that is correctly classified.
and:
Create a test set by selecting the rows with id> 160
Evaluate the performance of the classifier as follows:
What percentage of the test set was classified
correctly (correct answers on the total)? It is desirable that this number reaches at least 80%
What is the ratio of false positives to false negatives?
Successful marketing campaign:
successful_marketing_campaign = (dataset['sales'] > 15) | (dataset['total_invested'] < 20)
And my code:
X = dataset.iloc[:, [0, 3]].values.astype('int')
y = dataset.iloc[:, [4]].values.astype('int')
X_train = dataset.iloc[0:160, [0, 3]].values.astype('int')
y_train = dataset.iloc[0:160, 4].values.astype('int')
X_test = dataset.iloc[160:, [0, 3]].values.astype('int')
y_test = dataset.iloc[160:, 4].values.astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
print(matrix)
I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)
Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)
You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.
I have a small corpus and I want to calculate the accuracy of naive Bayes classifier using 10-fold cross validation, how can do it.
Your options are to either set this up yourself or use something like NLTK-Trainer since NLTK doesn't directly support cross-validation for machine learning algorithms.
I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.
Supposing you want 10-fold, you would have to partition your training set into 10 subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).
Assuming your training set is in a list named training, a simple way to accomplish this would be,
num_folds = 10
subset_size = len(training)/num_folds
for i in range(num_folds):
testing_this_round = training[i*subset_size:][:subset_size]
training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]
# train using training_this_round
# evaluate against testing_this_round
# save accuracy
# find mean accuracy over all rounds
Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).
Scikit provides cross_val_score, which does all the looping under the hood.
from sklearn.cross_validation import KFold, cross_val_score
k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
clf = <any classifier>
print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
I've used both libraries and NLTK for naivebayes sklearn for crossvalidation as follows:
import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)
for traincv, testcv in cv:
classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])
and at the end I calculated the average accuracy
Modified the second answer:
cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)
Inspired from Jared's answer, here is a version using a generator:
def k_fold_generator(X, y, k_fold):
subset_size = len(X) / k_fold # Cast to int if using Python 3
for k in range(k_fold):
X_train = X[:k * subset_size] + X[(k + 1) * subset_size:]
X_valid = X[k * subset_size:][:subset_size]
y_train = y[:k * subset_size] + y[(k + 1) * subset_size:]
y_valid = y[k * subset_size:][:subset_size]
yield X_train, y_train, X_valid, y_valid
I am assuming that your data set X has N data points (= 4 in the example) and D features (= 2 in the example). The associated N labels are stored in y.
X = [[ 1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 0, 1, 1]
k_fold = 2
for X_train, y_train, X_valid, y_valid in k_fold_generator(X, y, k_fold):
# Train using X_train and y_train
# Evaluate using X_valid and y_valid