Python Regression Variable Selection - python

I have a basic linear regression with 80 numerical variables (no classification variables). Training set has 1600 rows, testing 700.
I would like a python package that iterates through all column combinations to find the best custom score function or an out of the box score funtion like AIC.
OR
If that doesnt exist, what do people here use for variable selection? I know R has some packages like this but dont want deal with Rpy2
I have no preference if the LM requires scikit learn, numpy, pandas, statsmodels, or other.

I can suggest an answer that using the Least Absolute Shrinkage and Selection Operator(Lasso). I didn't use in a situation like you, that you have to deal with so many data.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
I often write a code to do linear regression with statsmodels like below,
import statsmodels.api as sm
model = sm.OLS()
results = model.fit(train_X,train_Y)
If I want to do Lasso regression, I write a code like below,
from sklearn import linear_model
model = linear_model.Lasso(alpha=1.0(default))
results = model.fit(train_X,train_Y)
You have to decide appropriate alpha between 0.0 and 1.0. The parameter is determined by how you don't accept the error.
Try this.

Related

Differences between implementations of OLS in Sk-learn and Statsmodels

I am currently running a linear regression on my time-series data set. However, depending on which python module I use, I get completely different results.
First I used Sklearn, and my model had an R^2 score of about 0.65. After that I tried using statsmodels.api, to get the summary of the regression, since Sklearn doesn't provide one, and I got a completely different R-2 score of 0.96.
After this, I used the linear model of statsmodels.formula.api and got another different result, this time, closer to my first result. (R^2 of 0.65)
I want to know why this happens. It seems like a mistake on my part, but I am pretty sure I am using the same data for all of the regressions (doing converting of the data frame to np.arrays where necessary). Can such large differences happen because of differences in implementation of the module?
Thank you for taking the time to read this.

Scikit learn-Classification

Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.
Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!

How to do linear regression using Python and Scikit learn using one hot encoding?

I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"
I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.
I am not sure how to take into account my one hot encoding, or if I even need to.
Input Data:
I tried reading here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.
I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.
You can do like:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)
# apply your linear regression as you want
model = LinearRegression()
model.fit(X_transform, y)
print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))
Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.
I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group: http://scikit-learn.org/stable/modules/linear_model.html).
Doing it the same ways as described by #silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn.
To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.
class DummyEncoder(BaseEstimator, TransformerMixin):
def __init__(self, n_values='auto'):
self.n_values = n_values
def transform(self, X):
ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
return ohe.fit_transform(X)[:,:-1]
def fit(self, X, y=None, **fit_params):
return self
So building on the code of #silviomoreto, you would change line 6:
enc = DummyEncoder()
This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.
I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (e.g. for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.
Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn (https://pypi.python.org/pypi/statsmodels), as it provides better model statistics, e.g. p-values per variable, etc.
how to prepare data for sklearn LinearRegression
OneHotEncode should only be used on the intended columns: those with categorical variables or strings, or integers that are essentially levels rather than numeric.
DO NOT apply OneHotEncode to your entire dataset including numerical variable or Booleans.
To prepare the data for sklearn LinearRegression, the numerical and categorical should be separately handled.
numerical columns: standardize if your model contains interactions or polynomial terms
categorical columns: apply OneHot either through sklearn or pd.get_dummies. pd.get_dummies is more flexible while OneHotEncode is more consistent in working with sklearn API.
drop='first'
As of version 0.22, OneHotEncoder in sklearn has drop option. For example OneHotEncoder(drop='first').fit(X), which is similar to
pd.get_dummies(drop_first=True).
use regularized linear regression
If you use regularized linear regression such as Lasso, multicollinear variables will be penalized and shrunk.
limitation of p-value statistics
The p-value in OLS is only valid when the OLS assumptions are more or less true. While there are methods to deal with situations when p-values cannot be trusted, one potential solution is to use cross validation or leave-one-out for gaining confidence on the model.

Bagging (bootstrap) of RFE using python scikit-learn

I want to perform bagging using python scikit-learn.
I want to combine RFE(), recursive feature selection algorithm.
The step is like below.
Make 30 subsets allowing redundant selection (bagging)
Perform RFE for each data set
Get output of each classification
find top 5 features from each output
I tried to use BaggingClassifier approach like below, but it took a lot of time and may not seem to work. Using only RFE works without problems(rfe.fit()).
cf1 = LinearSVC()
rfe = RFE(estimator=cf1)
bagging = BaggingClassifier(rfe, n_estimators=30)
bagging.fit(trainx, trainy)
Also, step 4 may be difficult to find top feature, because Bagging classifier does not offer the attribute like ranking_ in RFE().
Is there some other good ways to achieve those 4 steps?
Without bagging, one would access the ranking given by RFE with the following line:
rfe.ranking_
This order can be used to sort the features names, and then take the five first features. See the documentation for sklearn RFE for an example of this parameter.
With bagging, you would want access to each of your 30 estimators. Based on the documentation for sklearn BaggingClassifier, you can have access to them with:
bagging.estimators_
So: for each bagging in bagging.estimators_, get the ranking, sort the features based on this ranking, and take the first five elements !
Hope this helps.

Regression using PYMC3

I posted a IPython notebook here http://nbviewer.ipython.org/gist/dartdog/9008026
And I worked through both standard Statsmodels OLS and then similar with PYMC3 with the data provided via Pandas, that part works great by the way.
I can't see how to get the more standard parameters out of PYMC3? The examples seem to just use OLS to plot the base regression line. It seems that the PYMC3 model data should be able to give the parameters for the regression line? in addition to the probable traces,, ie what is the highest probability line?
Any further explanation of interpretation of Alpha, beta and sigma welcomed!
Also how to use PYMC3 model to estimate a future value of y given a new x ie prediction with some probability?
And lastly PYMC3 has a newish GLM wrapper which I tried and it seemed to get messed up? (it could well be me though)
The glm submodule sets some default priors which might very well not be appropriate for every case of which yours is one. You can change them by using the family argument, e.g.:
pm.glm.glm('y ~ x', data,
family=pm.glm.families.Normal(priors={'sd': ('sigma', pm.Uniform.dist(0, 12000))}))
Unfortunately this isn't very well documented yet and requires some good examples.

Categories