python classification without having to impute missing values - python

I have a dataset that is working nicely in weka. It has a lot of missing values represented by '?'. Using a decision tree, I am able to deal with the missing values.
However, on sci-kit learn, I see that the estimators can't used with data with missing values. Is there an alternative library I can use instead that would support this?
Otherwise, is there a way to get around this in sci-kit learn?

The py-earth package supports missing data. It's still in development and not yet on pypi, but it's pretty usable and well tested at this point and interacts well with scikit-learn. Missingness is handled as described in this paper. It does not assume missingness-at-random, and in fact missingness is treated as potentially predictive. The important assumption is that the distribution of missingness in your training data must be the same as in whatever data you use the model with in operation.
The Earth class provided by py-earth is a regressor. To create a classifier, you need to put it in a pipeline with some other scikit-learn classifier (I usually use LogisticRegression for this). Here's an example:
from pyearth import Earth
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline
# X and y are some training data (numpy arrays, pandas DataFrames, or
# similar) and X may have some values that are missing (nan, None, or
# some other standard signifier of missingness)
from your_data import X, y
# Create an Earth based classifer that accepts missing data
earth_classifier = Pipeline([('earth', Earth(allow_missing=True)),
('logistic', LogisticRegression())])
# Fit on the training data
earth_classifier.fit(X, y)
The Earth model handles missingness in a nice way, and the LogisticRegression only sees the transformed data coming out of Earth.transform.
Disclaimer: I am an author of py-earth.

Related

IterativeImputer - sample_posterior

Sklearn implements an imputer called the IterativeImputer. I believe that it works by predicting the values for missing features values in a round robin fashion, using an estimator.
It has an argument called sample_posterior but I can't seem to figure out when I should use it.
sample_posterior boolean, default=False
Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation. Estimator must support
return_std in its predict method if set to True. Set to True if using
IterativeImputer for multiple imputations.
I looked at the source code but it still wasn't clear to me. Should I use this if I have multiple features that I am going to fill using the iterative imputer or should I use this if I plan to use the imputer multiple times like for a training and then validation set?
Even with multiple features, and a training and validation/test set, you don't need sample_posterior. The "multiple imputations" part of the docstring means generating more than one missings-replaced dataset; see e.g. wikipedia.
Normally, IterativeImputer imputes the missing values of a feature using the predictions of a model built on the other features (iteratively, round robin, etc.). If you use a model that produces not just a single prediction but an output distribution (the posterior), then you can sample from that distribution randomly, hence sample_posterior. By running it multiple times, with different random seeds, these random choices are different, and you get multiple imputed datasets.
The documentation on that isn't great, but there's a (somewhat aged) PR for an extended example, and a toy example on SO.

Scikit-learn: preprocessing.scale() vs preprocessing.StandardScalar()

I understand that scaling means centering the mean(mean=0) and making unit variance(variance=1).
But, What is the difference between preprocessing.scale(x)and preprocessing.StandardScalar() in scikit-learn?
Those are doing exactly the same, but:
preprocessing.scale(x) is just a function, which transforms some data
preprocessing.StandardScaler() is a class supporting the Transformer API
I would always use the latter, even if i would not need inverse_transform and co. supported by StandardScaler().
Excerpt from the docs:
The function scale provides a quick and easy way to perform this operation on a single array-like dataset
The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. This class is hence suitable for use in the early steps of a sklearn.pipeline.Pipeline
My understanding is that scale will transform data in min-max range of the data, while standardscaler will transform data in range of [-1, 1].

Scikit learn-Classification

Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.
Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!

scikit-learn to learn and generate list of numbers

I have a large data of n-hundred-dimensional list of triplets consisting of numbers, mostly integers.
[(50,100,0.5),(20,35,1.0),.....]
[(70,80,0.3),(30,45,2.0),......]
....
I'm looking at sklearn to write a simple generative model that learns the patterns from these data, and generate a likely list of triplets, but my background is rather weak, without which the documentation is rather difficult to follow.
Is there an example sklearn code that does the similar job where I can take a look at?
I agree that this question is probably more appropriate for the data science or statistics sites, but I'll take a stab at it.
First, I'll assume that your data is in a pandas dataframe; this is convenient for scikit-learn as well as other Python packages.
I would first visualize the data. Since you only have three dimensions, a three-dimensional scatter plot might be useful. For instance, see here.
Another useful way to plot the data is to use pair plots. The seaborn package makes this very easy. See here. Pair plots are useful because they show distributions of each of the variables/features, as well as correlations between pairs of features.
At this point, creating a generative model depends on what the plots tell you. If, for instance, all of the variables are independent of one another, then you simply need to estimate the pdf for each variable independently (for instance, using kernel density estimation, which is also implemented in seaborn), and then generate new samples by drawing values from each of the three distributions separately and combining these values in a single tuple.
If the variables are not independent, then the task becomes more complicated, and probably warrants a separate post on the statistics site. For instance, your samples could be generated from different clusters, possibly overlapping, in which case something like a mixture model might be useful.
Here is a small code example that does exactly that (discriminative model):
import numpy as np
from sklearn.linear_model import LinearRegression
#generate random numpy array of the size 10,3
X_train = np.random.random((10,3))
y_train = np.random.random((10,3))
X_test = np.random.random((10,3))
#define the regression
clf = LinearRegression()
#fit & predict (predict returns numpy array of the same dimensions)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Otherwise here are more examples:
http://scikit-learn.org/stable/auto_examples/index.html
The generative model would be sklearn.mixture.GaussianMixture (works only in version 0.18)

How to do linear regression using Python and Scikit learn using one hot encoding?

I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"
I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.
I am not sure how to take into account my one hot encoding, or if I even need to.
Input Data:
I tried reading here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.
I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.
You can do like:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)
# apply your linear regression as you want
model = LinearRegression()
model.fit(X_transform, y)
print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))
Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.
I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group: http://scikit-learn.org/stable/modules/linear_model.html).
Doing it the same ways as described by #silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn.
To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.
class DummyEncoder(BaseEstimator, TransformerMixin):
def __init__(self, n_values='auto'):
self.n_values = n_values
def transform(self, X):
ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
return ohe.fit_transform(X)[:,:-1]
def fit(self, X, y=None, **fit_params):
return self
So building on the code of #silviomoreto, you would change line 6:
enc = DummyEncoder()
This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.
I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (e.g. for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.
Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn (https://pypi.python.org/pypi/statsmodels), as it provides better model statistics, e.g. p-values per variable, etc.
how to prepare data for sklearn LinearRegression
OneHotEncode should only be used on the intended columns: those with categorical variables or strings, or integers that are essentially levels rather than numeric.
DO NOT apply OneHotEncode to your entire dataset including numerical variable or Booleans.
To prepare the data for sklearn LinearRegression, the numerical and categorical should be separately handled.
numerical columns: standardize if your model contains interactions or polynomial terms
categorical columns: apply OneHot either through sklearn or pd.get_dummies. pd.get_dummies is more flexible while OneHotEncode is more consistent in working with sklearn API.
drop='first'
As of version 0.22, OneHotEncoder in sklearn has drop option. For example OneHotEncoder(drop='first').fit(X), which is similar to
pd.get_dummies(drop_first=True).
use regularized linear regression
If you use regularized linear regression such as Lasso, multicollinear variables will be penalized and shrunk.
limitation of p-value statistics
The p-value in OLS is only valid when the OLS assumptions are more or less true. While there are methods to deal with situations when p-values cannot be trusted, one potential solution is to use cross validation or leave-one-out for gaining confidence on the model.

Categories