Why K-Cross validation need to fit first? - python

I get an error in the following code unless I do a fit on the SVC:
This SVC instance is not fitted yet. Call 'fit' with appropriate
arguments before using this method.
Unless I do this:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
Why I need to do a fit before doing a cross validation?
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
# Split the iris data into train/test data sets with 40% reserved for testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target,
test_size=0.4, random_state=0)
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
# Now measure its performance with the test data
clf.score(X_test, y_test)
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)

You don't. Your cross_val_score runs fine without the fit.
You do need to fit before running score.

The reason you are seeing that error is because you are asking your estimator (clf) to compute the accuracy of its classifications (with the clf.score method) before it actually knows how to do the classification. To teach clf how to do the classification you have to train it by calling the fit method. This is what the error message is trying to tell you.
score in the above sense has nothing to do with cross-validation, only accuracy. The cross_val_score helper method you use can take an untrained estimator and compute a cross-validated score for you data. This helper trains the estimator for you and that's why you don't have to call fit before using this helper.
See the documentation for cross-validation for more information.

Related

How can I train a model in statsmodels?

This is a pretty straightforward question and I know some will be inclined to give a -1, but please let me explain better.
Most of statsmodels tutorials in the internet (such as this, this and this) usually create a Linear Regression without splitting the dataset into train and test. They create a linear regression using this sintax:
import statsmodels.formula.api as sm
sm.ols('y~x1+x2+x3', data=df).fit()
There is no need to say how dangerous is to build a model without a test dataset.
My question here is how can I create a linear regression with statsmodels, using train and test split?
After searching a lot, I found this approach:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, target, train_size=0.8, random_state=42
)
import statsmodels.api as sm
smfOLS = smf.OLS(X_train, y_train).fit()
However, I'm getting this error:
AttributeError: module 'statsmodels.formula.api' has no attribute 'OLS'
I know I should provide a dataset, but unfortunately, I'm working with confidential data. But any dataset you have should be enough to understand the situation.
Try this,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, target, train_size=0.8, random_state=42
)
import statsmodels.api as sm
smfOLS = sm.OLS(y_train, X_train).fit()

How to make naive bayes multinomial with TF-idf from scratch in python?

I know there is a library in python
from sklearn.naive_bayes import MultinomialNB
but I want to know how to create one from scratch without using libraries like TfIdfVectorizer and MultinomialNB?
Here is the step-by-step about how to make simple MNB Classifier with TF-IDF
First, you need to import the method TfIdfVectorizer to tokenize the terms inside the dataset, the MultinomialNB as the classifier, and the train_test_split for splitting the dataset. (Both are available in sklearn).
Split the dataset into train and test sets.
Initialize the constructor of TfIdfVectorizer, then Vectorize/Tokenize the train set by the method fit_transform.
Vectorize/Fit the test set with the method fit.
Initialize the classifier by calling the constructor MultinomialNB().
model = MultinomialNB() # with default hyperparameters
Train the classifier with the train set.
model.fit(X_train, y_train)
Test/Validate the classifier with the test set.
model.predict(X_test, y_test)
Those 7 steps above are the simple steps. Apparently you can also do the text preprocessing and also model evaluation.

How to build re-usable scikit-learn pipeline for Random Forest Classifier?

I am trying to understand how scikit-learn pipelines work. I have some dummy data and I am trying to fit a Random Forest model to iris data. Here is some code
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import sklearn.externals
import joblib
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
Divide data into train and test and create a pipeline with 2 steps
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
pipeline = Pipeline([('feature_selection', SelectKBest(chi2, k=2)), ('classification', RandomForestClassifier()) ])
print(type(pipeline))
(112, 4) (38, 4) (112,) (38,)
<class 'sklearn.pipeline.Pipeline'>
But when i execute pipeline.fit_transform(X_train, y_train) , I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
However, pipeline.fit(X_train, y_train) works fine.
In a normal case scenario, without any pipeline code, what i have usually done is taken a ML model and applied fit_transform() on my training dataset and transform on my unseen dataset for generating predictions.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
But when i execute pipeline.fit_transform(X_train, y_train), I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Indeed, RandomForestClassifier does not transform data because it is a model, not a transformer. Pipelines implement either transform or predict (and its variants) depending on whether the last estimator is a transformer or a model.
So, generally, you'll want to call just pipeline.fit(X_train, y_train), then in testing or production you'll call pipeline.predict(X_test, y_test) (or predict_proba, or ...), which internally will transform with the first step(s) and predict with the last step.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Yes; see sklearn Model Persistence for more details and recommendations.
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
You can access individual steps of a pipeline in a few ways; see sklearn Pipeline accessing steps
pipeline.named_steps.classification
pipeline['classification']
pipeline[-1]

Can I use XGBoost to boost other models (eg. Naive Bayes, Random Forest)?

I am working on a fraud analytics project and I need some help with boosting. Previously, I used SAS Enterprise Miner to learn more about boosting/ensemble techniques and I learned that boosting can help to improve the performance of a model.
Currently, my group have completed the following models on Python: Naive Bayes, Random Forest, and Neural Network We want to use XGBoost to make the F1-score better. I am not sure if this is possible since I only come across tutorials on how to do XGBoost or Naive Bayes on its own.
I am looking for a tutorial where they will show you how to create a Naive Bayes model and then use boosting. After that, we can compare the metrics with and without boosting to see if it improved. I am relatively new to machine learning so I could be wrong about this concept.
I thought of replacing the values in the XGBoost but not sure which one to change or if it can even work this way.
Naive Bayes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm,y_sm, test_size = 0.2, random_state=0)
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
nb = GaussianNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)
XGBoost
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm,y_sm, test_size = 0.2, random_state=0)
model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.9, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=10,
min_child_weight=1, missing=None, n_estimators=500, n_jobs=-1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=0.9, verbosity=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
In theory, boosting any (base) classifier is easy and straightforward with scikit-learn's AdaBoostClassifier. E.g. for a Naive Bayes classifier, it should be:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
model = AdaBoostClassifier(base_estimator=nb, n_estimators=10)
model.fit(X_train, y_train)
and so on.
In practice, we never use Naive Bayes or Neural Nets as base classifiers for boosting (let alone Random Forests, which are themselves an ensemble method).
Adaboost (and similar boosting methods that have been derived afterwards, like GBM and XGBoost) was conceived using decision trees (DTs) as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument in scikit-learn's AdaBoostClassifier above, it assumes a value of DecisionTreeClassifier(max_depth=1), i.e. a decision stump.
DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with the other algorithms mentioned, hence the latter are not expected to offer anything when used as base classifiers for boosting algorithms.

KNeighborsClassifier .predict() function doesn't work

i am working with KNeighborsClassifier algorithm from scikit-learn library in Python. I followed basic instructions e.g. split my data and labels into training and test data, then trained my model on a training data. Now I am trying to predict accuracy of testing data but get an error. Here is my code:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
data_train, data_test, label_train, label_test = train_test_split(df, labels,
test_size=0.2,
random_state=7)
mod = KNeighborsClassifier(n_neighbors=4)
mod.fit(data_train, label_train)
predictions = mod.predict(data_test)
print accuracy_score(label_train, predictions)
The error I get:
ValueError: Found arrays with inconsistent numbers of samples: [140 558]
140 is the portion of training data and 558 is the test data based on the test_size=0.2 (my data set is 698 samples). I verified that labels and data sets are of the same size 698. However, I get this error which is basically trying to compare test data and training data sets.
Does anyone knows what is wrong here? What should I use to train my model against to and what should I use to predict the score?
Thanks!
You should calculate the accuracy_score with label_test, not label_train. You want to compare the actual labels of the test set, label_test, to the predictions from your model, predictions, for the test set.
Did you tried to solve your issue via the following question ?
sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

Categories