Differences between implementations of OLS in Sk-learn and Statsmodels

Differences between implementations of OLS in Sk-learn and Statsmodels - python

I am currently running a linear regression on my time-series data set. However, depending on which python module I use, I get completely different results.
First I used Sklearn, and my model had an R^2 score of about 0.65. After that I tried using statsmodels.api, to get the summary of the regression, since Sklearn doesn't provide one, and I got a completely different R-2 score of 0.96.
After this, I used the linear model of statsmodels.formula.api and got another different result, this time, closer to my first result. (R^2 of 0.65)
I want to know why this happens. It seems like a mistake on my part, but I am pretty sure I am using the same data for all of the regressions (doing converting of the data frame to np.arrays where necessary). Can such large differences happen because of differences in implementation of the module?
Thank you for taking the time to read this.

Related

Why do I need to call fit() before transform() when using PolynomialFeatures?

Hello to all you great minds,
I'm trying to understand more rigorously the way polynomial fitting works with scikit. More specifically, what I'm trying to do is break down the process, and to only show a dataframe with the new polynomial features generated based on a single value.
So I have data which with several entries, each is 1-dimensional. I want to generate a design matrix suitable for polynomial fitting. What I am currently doing is along these lines:
pd.DataFrame(PolynomialFeatures(k).fit_transform(X))
And this works as expected.
However, what I'm struggling with is the role of fit_transform(). As far as I am concerned, and I not trying to fit anything quiet yet, merely produce a dataframe with the newly constructed polynomial features. Naively I tried changing fit_transform() to transform(), but apparently I have to use fit before I am allowed to transform.
I would appreciate it if anyone could point me to my error. I am not yet trying to fit a model on the data, only to create a design matrix with the polynomial features, so why do I have to use fit() (or fit_transform(), to that matter)? In fact, I don't really understand what fit() actually does here, and the documentation didn't help me wrap my head around it.
Thank you!

I think the reason for this is to be consistent with their API. When doing preprocessing you still want to "fit" to some train data and apply the same preprocessing step to the train AND the test data.
An example where it becomes more clear is Standardscaling (which is a different preprocessing step). You calculate the mean and std from the train data and apply the same scaling (X - mean) / std to the train AND test data (with the mean and std taken from the train data.
Therefore the two methods fit and transform are separated.
In your case of polynomial features it probably makes no sense to "fit", because no information is extracted from the train data and the step can directly be applied to the test data without knowing the train data. But including the fit in PolynomialFeatures makes it consistent with their whole API. The consistency becomes necessary when you pipe multiple preprocessing steps.

Using K-Means with predefined centers?

I'm running a KNN classifier whose feature vectors come from a K-Means classifier (more specifically, sklearn.cluster.MiniBatchKMeans). Since the K-means starts with random points every time I'm getting different results every time I run my algorithm. I've stored the cluster centers in a separate .npy file from a time where results were good, but now I need to use those centers in my K-means and I don't know how.
Following this advice, I tried to use the cluster centers as starting points like so:
MiniBatchKMeans.__init__(self, n_clusters=self.clusters, n_init=1, init=np.load('cluster_centers.npy'))
Still, results change every time the algorithm is run.
Then I tried to manually alter the cluster centers after fitting the data:
kMeansInstance.cluster_centers_ = np.load('cluster_centers.npy')
Still, different results each time.
The only other solution I can think of is manually implementing the predict method using the centers I saved, but I don't know how and I don't know if there is a better way to solve my problem than rewriting the wheel.

I would guess fixing the random_state will do the job.
See API docu.

Mini batch k-means only considers a sample of the data.
It uses a random generator for this.
If you want deterministic behaviour, fix the random seed, and prefer algorithms that do not use a random sample (i.e., use the regular k-means instead of mini-batch k-means).

Distributed Lag Model in Python

I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!

If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.

Python Regression Variable Selection

I have a basic linear regression with 80 numerical variables (no classification variables). Training set has 1600 rows, testing 700.
I would like a python package that iterates through all column combinations to find the best custom score function or an out of the box score funtion like AIC.
OR
If that doesnt exist, what do people here use for variable selection? I know R has some packages like this but dont want deal with Rpy2
I have no preference if the LM requires scikit learn, numpy, pandas, statsmodels, or other.

I can suggest an answer that using the Least Absolute Shrinkage and Selection Operator(Lasso). I didn't use in a situation like you, that you have to deal with so many data.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
I often write a code to do linear regression with statsmodels like below,
import statsmodels.api as sm
model = sm.OLS()
results = model.fit(train_X,train_Y)
If I want to do Lasso regression, I write a code like below,
from sklearn import linear_model
model = linear_model.Lasso(alpha=1.0(default))
results = model.fit(train_X,train_Y)
You have to decide appropriate alpha between 0.0 and 1.0. The parameter is determined by how you don't accept the error.
Try this.

Quickest linear regression implementation in python

I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.
In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.
What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?
Or maybe an mcmc based approach using pymc is quickest...
Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.
Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...

The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.
I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.
Once you find a few candidate models, then you can try those on the whole dataset.

Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.
PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Differences between implementations of OLS in Sk-learn and Statsmodels - python

Related

Why do I need to call fit() before transform() when using PolynomialFeatures?

Using K-Means with predefined centers?

Distributed Lag Model in Python

Python Regression Variable Selection

Quickest linear regression implementation in python

Categories

Resources