How can I train a model in statsmodels? - python

This is a pretty straightforward question and I know some will be inclined to give a -1, but please let me explain better.
Most of statsmodels tutorials in the internet (such as this, this and this) usually create a Linear Regression without splitting the dataset into train and test. They create a linear regression using this sintax:
import statsmodels.formula.api as sm
sm.ols('y~x1+x2+x3', data=df).fit()
There is no need to say how dangerous is to build a model without a test dataset.
My question here is how can I create a linear regression with statsmodels, using train and test split?
After searching a lot, I found this approach:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, target, train_size=0.8, random_state=42
)
import statsmodels.api as sm
smfOLS = smf.OLS(X_train, y_train).fit()
However, I'm getting this error:
AttributeError: module 'statsmodels.formula.api' has no attribute 'OLS'
I know I should provide a dataset, but unfortunately, I'm working with confidential data. But any dataset you have should be enough to understand the situation.

Try this,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, target, train_size=0.8, random_state=42
)
import statsmodels.api as sm
smfOLS = sm.OLS(y_train, X_train).fit()

Related

How to build re-usable scikit-learn pipeline for Random Forest Classifier?

I am trying to understand how scikit-learn pipelines work. I have some dummy data and I am trying to fit a Random Forest model to iris data. Here is some code
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import sklearn.externals
import joblib
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
Divide data into train and test and create a pipeline with 2 steps
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
pipeline = Pipeline([('feature_selection', SelectKBest(chi2, k=2)), ('classification', RandomForestClassifier()) ])
print(type(pipeline))
(112, 4) (38, 4) (112,) (38,)
<class 'sklearn.pipeline.Pipeline'>
But when i execute pipeline.fit_transform(X_train, y_train) , I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
However, pipeline.fit(X_train, y_train) works fine.
In a normal case scenario, without any pipeline code, what i have usually done is taken a ML model and applied fit_transform() on my training dataset and transform on my unseen dataset for generating predictions.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
But when i execute pipeline.fit_transform(X_train, y_train), I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Indeed, RandomForestClassifier does not transform data because it is a model, not a transformer. Pipelines implement either transform or predict (and its variants) depending on whether the last estimator is a transformer or a model.
So, generally, you'll want to call just pipeline.fit(X_train, y_train), then in testing or production you'll call pipeline.predict(X_test, y_test) (or predict_proba, or ...), which internally will transform with the first step(s) and predict with the last step.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Yes; see sklearn Model Persistence for more details and recommendations.
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
You can access individual steps of a pipeline in a few ways; see sklearn Pipeline accessing steps
pipeline.named_steps.classification
pipeline['classification']
pipeline[-1]

Is it possible to tune the linear regression (hyper)parameter in sklearn

I'm starting to learn a bit of sci-kit learn and ML in general and i'm running into a problem.
I've created a model using linear regression.
the .score is good (above 0.8) but i want to get it better (perhaps to 0.9).
I've searched the documentation of sklearn and googled this question but I cannot seem to find the answer.
My question is: Is it possible to tune the LinearRegression model?
and if so, where can I find it?
#----- Forecast in hours -----#
forecast_out = 48
#----- Import and prep data -----#
using pandas to create X and y
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#----- Linear Regression-----#
lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
lr.fit(x_train, y_train)
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)
x_forecast = np.array(data.drop(['Prediction'],1))[-forecast_out:]
lr_prediction = lr.predict(x_forecast)
There is always room for improvement. Parameters are there in the LinearRegression model. Use .get_params() to find out parameters names and their default values, and then use .set_params(**params) to set values from a dictionary.
GridSearchCV and RandomSearchCV can help you tune them better than you can, and quicker.
This is a very open-ended question and you should just look up the documentation. It's all there, really, trust me - I've looked. Just Google LinearRegression documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
It seems that sklearn.linear_model.LinearRegression does not have hyperparameters that can be tuned. So, instead please use sklearn.linear_model.SGDRegressor, which will provide many possiblites for tuning hyperparameters.
Its documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html .
No, it is not possible.
For Hyperparams tune Linear Regressions, try Lasso, Ridge or ElasticNet

Can I use XGBoost to boost other models (eg. Naive Bayes, Random Forest)?

I am working on a fraud analytics project and I need some help with boosting. Previously, I used SAS Enterprise Miner to learn more about boosting/ensemble techniques and I learned that boosting can help to improve the performance of a model.
Currently, my group have completed the following models on Python: Naive Bayes, Random Forest, and Neural Network We want to use XGBoost to make the F1-score better. I am not sure if this is possible since I only come across tutorials on how to do XGBoost or Naive Bayes on its own.
I am looking for a tutorial where they will show you how to create a Naive Bayes model and then use boosting. After that, we can compare the metrics with and without boosting to see if it improved. I am relatively new to machine learning so I could be wrong about this concept.
I thought of replacing the values in the XGBoost but not sure which one to change or if it can even work this way.
Naive Bayes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm,y_sm, test_size = 0.2, random_state=0)
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
nb = GaussianNB()
nb.fit(X_train, y_train)
nb_pred = nb.predict(X_test)
XGBoost
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm,y_sm, test_size = 0.2, random_state=0)
model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.9, gamma=0,
learning_rate=0.1, max_delta_step=0, max_depth=10,
min_child_weight=1, missing=None, n_estimators=500, n_jobs=-1,
nthread=None, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=0.9, verbosity=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
In theory, boosting any (base) classifier is easy and straightforward with scikit-learn's AdaBoostClassifier. E.g. for a Naive Bayes classifier, it should be:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
model = AdaBoostClassifier(base_estimator=nb, n_estimators=10)
model.fit(X_train, y_train)
and so on.
In practice, we never use Naive Bayes or Neural Nets as base classifiers for boosting (let alone Random Forests, which are themselves an ensemble method).
Adaboost (and similar boosting methods that have been derived afterwards, like GBM and XGBoost) was conceived using decision trees (DTs) as base classifiers (more specifically, decision stumps, i.e. DTs with a depth of only 1); there is good reason why still today, if you don't specify explicitly the base_classifier argument in scikit-learn's AdaBoostClassifier above, it assumes a value of DecisionTreeClassifier(max_depth=1), i.e. a decision stump.
DTs are suitable for such ensembling because they are essentially unstable classifiers, which is not the case with the other algorithms mentioned, hence the latter are not expected to offer anything when used as base classifiers for boosting algorithms.

XGBoost: split data in train and test

I am using the python interface for XGBoost for building models. I have a dataset that I am reading in using xgb.DMatrix(data_path). I need to split this data into train and test (and validation, if required). But most of the implementations I have seen are of the form
dtrain = xgb.DMatrix('')
dtest = xgb.DMatrix('')
I couldn't find a way to where we can read in the dataset and then split 'em into train, test (and validation) sets.
Furthermore, is it possible to perform stratified sampling while splitting into train and test?
I need to know this because I have slightly larger datasets and currently I am reading it in using spark, splitting them up, storing on disk and then reading from there. Is there a way I can do it without having to go through Pyspark and reading from the hdfs?
I would use sklearn's train_test_split, which also has a stratify parameter, and then put the results into dtrain and dtest.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
See implementation here: A Simple XGBoost Tutorial Using the Iris Dataset.
You can always read in data from HDF5 files using pandas (see pandas.HDStore) then do splitting using sklearn (either simple random or stratified train/train split, see stratify parameter of train_test_split). And then you can feed the pandas DataFrame objects directly into the sklearn API of xgboost or convert those into xgboost.DMatrix and use those in the native training API

Why K-Cross validation need to fit first?

I get an error in the following code unless I do a fit on the SVC:
This SVC instance is not fitted yet. Call 'fit' with appropriate
arguments before using this method.
Unless I do this:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
Why I need to do a fit before doing a cross validation?
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
# Split the iris data into train/test data sets with 40% reserved for testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target,
test_size=0.4, random_state=0)
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
# Now measure its performance with the test data
clf.score(X_test, y_test)
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
You don't. Your cross_val_score runs fine without the fit.
You do need to fit before running score.
The reason you are seeing that error is because you are asking your estimator (clf) to compute the accuracy of its classifications (with the clf.score method) before it actually knows how to do the classification. To teach clf how to do the classification you have to train it by calling the fit method. This is what the error message is trying to tell you.
score in the above sense has nothing to do with cross-validation, only accuracy. The cross_val_score helper method you use can take an untrained estimator and compute a cross-validated score for you data. This helper trains the estimator for you and that's why you don't have to call fit before using this helper.
See the documentation for cross-validation for more information.

Categories