I understand that scaling means centering the mean(mean=0) and making unit variance(variance=1).
But, What is the difference between preprocessing.scale(x)and preprocessing.StandardScalar() in scikit-learn?
Those are doing exactly the same, but:
preprocessing.scale(x) is just a function, which transforms some data
preprocessing.StandardScaler() is a class supporting the Transformer API
I would always use the latter, even if i would not need inverse_transform and co. supported by StandardScaler().
Excerpt from the docs:
The function scale provides a quick and easy way to perform this operation on a single array-like dataset
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline
My understanding is that scale will transform data in min-max range of the data, while standardscaler will transform data in range of [-1, 1].
Related
When working with text data, I understand the need to encode text labels into some numeric representation (i.e., by using LabelEncoder, OneHotEncoder etc.)
However, my question is whether you need to perform this step explicitly when you're using some feature extraction class (i.e. TfidfVectorizer, CountVectorizer etc.) or whether these will encode the labels under the hood for you?
If you do need to encode the labels separately yourself, are you able to perform this step in a Pipeline (such as the one below)
pipeline = Pipeline(steps=[
('tfidf', TfidfVectorizer()),
('sgd', SGDClassifier())
])
Or do you need encode the labels beforehand since the pipeline expects to fit() and transform() the data (not the labels)?
Have a look into the scikit-learn glossary for the term transform:
In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.
In fact, almost all transformers only transform the features. This holds true for TfidfVectorizer and CountVectorizer as well. If ever in doubt, you can always check the return type of the transforming function (like the fit_transform method of CountVectorizer).
Same goes when you assemble several transformers in a pipeline. It is stated in its user guide:
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).
So in conclusion, you typically handle the labels separately and before you fit the estimator/pipeline.
Hello to all you great minds,
I'm trying to understand more rigorously the way polynomial fitting works with scikit. More specifically, what I'm trying to do is break down the process, and to only show a dataframe with the new polynomial features generated based on a single value.
So I have data which with several entries, each is 1-dimensional. I want to generate a design matrix suitable for polynomial fitting. What I am currently doing is along these lines:
pd.DataFrame(PolynomialFeatures(k).fit_transform(X))
And this works as expected.
However, what I'm struggling with is the role of fit_transform(). As far as I am concerned, and I not trying to fit anything quiet yet, merely produce a dataframe with the newly constructed polynomial features. Naively I tried changing fit_transform() to transform(), but apparently I have to use fit before I am allowed to transform.
I would appreciate it if anyone could point me to my error. I am not yet trying to fit a model on the data, only to create a design matrix with the polynomial features, so why do I have to use fit() (or fit_transform(), to that matter)? In fact, I don't really understand what fit() actually does here, and the documentation didn't help me wrap my head around it.
Thank you!
I think the reason for this is to be consistent with their API. When doing preprocessing you still want to "fit" to some train data and apply the same preprocessing step to the train AND the test data.
An example where it becomes more clear is Standardscaling (which is a different preprocessing step). You calculate the mean and std from the train data and apply the same scaling (X - mean) / std to the train AND test data (with the mean and std taken from the train data.
Therefore the two methods fit and transform are separated.
In your case of polynomial features it probably makes no sense to "fit", because no information is extracted from the train data and the step can directly be applied to the test data without knowing the train data. But including the fit in PolynomialFeatures makes it consistent with their whole API. The consistency becomes necessary when you pipe multiple preprocessing steps.
Sklearn implements an imputer called the IterativeImputer. I believe that it works by predicting the values for missing features values in a round robin fashion, using an estimator.
It has an argument called sample_posterior but I can't seem to figure out when I should use it.
sample_posterior boolean, default=False
Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation. Estimator must support
return_std in its predict method if set to True. Set to True if using
IterativeImputer for multiple imputations.
I looked at the source code but it still wasn't clear to me. Should I use this if I have multiple features that I am going to fill using the iterative imputer or should I use this if I plan to use the imputer multiple times like for a training and then validation set?
Even with multiple features, and a training and validation/test set, you don't need sample_posterior. The "multiple imputations" part of the docstring means generating more than one missings-replaced dataset; see e.g. wikipedia.
Normally, IterativeImputer imputes the missing values of a feature using the predictions of a model built on the other features (iteratively, round robin, etc.). If you use a model that produces not just a single prediction but an output distribution (the posterior), then you can sample from that distribution randomly, hence sample_posterior. By running it multiple times, with different random seeds, these random choices are different, and you get multiple imputed datasets.
The documentation on that isn't great, but there's a (somewhat aged) PR for an extended example, and a toy example on SO.
I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"
I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.
I am not sure how to take into account my one hot encoding, or if I even need to.
Input Data:
I tried reading here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.
I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.
You can do like:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)
# apply your linear regression as you want
model = LinearRegression()
model.fit(X_transform, y)
print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))
Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.
I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group: http://scikit-learn.org/stable/modules/linear_model.html).
Doing it the same ways as described by #silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn.
To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.
class DummyEncoder(BaseEstimator, TransformerMixin):
def __init__(self, n_values='auto'):
self.n_values = n_values
def transform(self, X):
ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
return ohe.fit_transform(X)[:,:-1]
def fit(self, X, y=None, **fit_params):
return self
So building on the code of #silviomoreto, you would change line 6:
enc = DummyEncoder()
This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.
I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (e.g. for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.
Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn (https://pypi.python.org/pypi/statsmodels), as it provides better model statistics, e.g. p-values per variable, etc.
how to prepare data for sklearn LinearRegression
OneHotEncode should only be used on the intended columns: those with categorical variables or strings, or integers that are essentially levels rather than numeric.
DO NOT apply OneHotEncode to your entire dataset including numerical variable or Booleans.
To prepare the data for sklearn LinearRegression, the numerical and categorical should be separately handled.
numerical columns: standardize if your model contains interactions or polynomial terms
categorical columns: apply OneHot either through sklearn or pd.get_dummies. pd.get_dummies is more flexible while OneHotEncode is more consistent in working with sklearn API.
drop='first'
As of version 0.22, OneHotEncoder in sklearn has drop option. For example OneHotEncoder(drop='first').fit(X), which is similar to
pd.get_dummies(drop_first=True).
use regularized linear regression
If you use regularized linear regression such as Lasso, multicollinear variables will be penalized and shrunk.
limitation of p-value statistics
The p-value in OLS is only valid when the OLS assumptions are more or less true. While there are methods to deal with situations when p-values cannot be trusted, one potential solution is to use cross validation or leave-one-out for gaining confidence on the model.
I have a dataset that is working nicely in weka. It has a lot of missing values represented by '?'. Using a decision tree, I am able to deal with the missing values.
However, on sci-kit learn, I see that the estimators can't used with data with missing values. Is there an alternative library I can use instead that would support this?
Otherwise, is there a way to get around this in sci-kit learn?
The py-earth package supports missing data. It's still in development and not yet on pypi, but it's pretty usable and well tested at this point and interacts well with scikit-learn. Missingness is handled as described in this paper. It does not assume missingness-at-random, and in fact missingness is treated as potentially predictive. The important assumption is that the distribution of missingness in your training data must be the same as in whatever data you use the model with in operation.
The Earth class provided by py-earth is a regressor. To create a classifier, you need to put it in a pipeline with some other scikit-learn classifier (I usually use LogisticRegression for this). Here's an example:
from pyearth import Earth
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline
# X and y are some training data (numpy arrays, pandas DataFrames, or
# similar) and X may have some values that are missing (nan, None, or
# some other standard signifier of missingness)
from your_data import X, y
# Create an Earth based classifer that accepts missing data
earth_classifier = Pipeline([('earth', Earth(allow_missing=True)),
('logistic', LogisticRegression())])
# Fit on the training data
earth_classifier.fit(X, y)
The Earth model handles missingness in a nice way, and the LogisticRegression only sees the transformed data coming out of Earth.transform.
Disclaimer: I am an author of py-earth.