IterativeImputer - sample_posterior - python

Sklearn implements an imputer called the IterativeImputer. I believe that it works by predicting the values for missing features values in a round robin fashion, using an estimator.
It has an argument called sample_posterior but I can't seem to figure out when I should use it.
sample_posterior boolean, default=False
Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation. Estimator must support
return_std in its predict method if set to True. Set to True if using
IterativeImputer for multiple imputations.
I looked at the source code but it still wasn't clear to me. Should I use this if I have multiple features that I am going to fill using the iterative imputer or should I use this if I plan to use the imputer multiple times like for a training and then validation set?

Even with multiple features, and a training and validation/test set, you don't need sample_posterior. The "multiple imputations" part of the docstring means generating more than one missings-replaced dataset; see e.g. wikipedia.
Normally, IterativeImputer imputes the missing values of a feature using the predictions of a model built on the other features (iteratively, round robin, etc.). If you use a model that produces not just a single prediction but an output distribution (the posterior), then you can sample from that distribution randomly, hence sample_posterior. By running it multiple times, with different random seeds, these random choices are different, and you get multiple imputed datasets.
The documentation on that isn't great, but there's a (somewhat aged) PR for an extended example, and a toy example on SO.

Related

Why xgboost BDT model constructed with histogram tree method depends on the training data ordering?

I was using Python xgboost to train some models (with binary logistic) on some data (50k in total) and I used the histogram tree method for the training (tree_method="hist"). I shuffled the events in the data and used them for the training. It turned out that the models built are slightly different depending on the order of events and some result based on the corresponding prediction on validation set (different than training set) could vary up to 5%. As a double check I also used the lightgbm and this effect is also presented. It seems that this is a problem of histogram method because if I use the exact method in xgboost (tree_method="exact") then this problem disappears.
Does anyone know why the bdt model based on histogram method depends on the event order?
I tried to look for the reference paper but was totally lost.

Why do I need to call fit() before transform() when using PolynomialFeatures?

Hello to all you great minds,
I'm trying to understand more rigorously the way polynomial fitting works with scikit. More specifically, what I'm trying to do is break down the process, and to only show a dataframe with the new polynomial features generated based on a single value.
So I have data which with several entries, each is 1-dimensional. I want to generate a design matrix suitable for polynomial fitting. What I am currently doing is along these lines:
pd.DataFrame(PolynomialFeatures(k).fit_transform(X))
And this works as expected.
However, what I'm struggling with is the role of fit_transform(). As far as I am concerned, and I not trying to fit anything quiet yet, merely produce a dataframe with the newly constructed polynomial features. Naively I tried changing fit_transform() to transform(), but apparently I have to use fit before I am allowed to transform.
I would appreciate it if anyone could point me to my error. I am not yet trying to fit a model on the data, only to create a design matrix with the polynomial features, so why do I have to use fit() (or fit_transform(), to that matter)? In fact, I don't really understand what fit() actually does here, and the documentation didn't help me wrap my head around it.
Thank you!
I think the reason for this is to be consistent with their API. When doing preprocessing you still want to "fit" to some train data and apply the same preprocessing step to the train AND the test data.
An example where it becomes more clear is Standardscaling (which is a different preprocessing step). You calculate the mean and std from the train data and apply the same scaling (X - mean) / std to the train AND test data (with the mean and std taken from the train data.
Therefore the two methods fit and transform are separated.
In your case of polynomial features it probably makes no sense to "fit", because no information is extracted from the train data and the step can directly be applied to the test data without knowing the train data. But including the fit in PolynomialFeatures makes it consistent with their whole API. The consistency becomes necessary when you pipe multiple preprocessing steps.

fit() vs fit_predict() metthods in sklearn KMeans

There are two methods when we make a model on sklearn.cluster.KMeans. First is fit() and other is fit_predict(). My understanding is that when we use fit() method on KMeans model, it gives an attribute labels_ which basically holds the info on which observation belong to which cluster. fit_predict() also have labels_ attribute.
So my question are,
If fit() fulfills the need then why their is fit_predict()?
Are fit() and fit_predict() interchangeable while writing code?
KMeans is just one of the many models that sklearn has, and many share the same API. The basic functions ae fit, which teaches the model using examples, and predict, which uses the knowledge obtained by fit to answer questions on potentially new values.
KMeans will automatically predict the cluster of all the input data during the training, because doing so is integral to the algorithm. It keeps them around for efficiency, because predicting the labels for the original dataset is very common. Thus, fit_predict adds very little: it calls fit, then returns .labels_. fit_predict is just a convenience method that calls fit, then returns the labels of the training dataset. (fit_predict doesn't have a labels_ attribute, it just gives you the labels.)
However, if you want to train your model on one set of data and then use this to quickly (and without changing the established cluster boundaries) get an answer for a data point that was not in the original data, you would need to use predict, not fit_predict.
In other models (for example sklearn.neural_network.MLPClassifier), training can be a very expensive operation so you may not want to re-train a model every time you want to predict something; also, it may not be a given that the prediction result is generated as a part of the prediction. Or, as discussed above, you just don't want to change the model in response to new data. In those cases, you cannot get predictions from the result of fit: you need to call predict with the data you want to get a prediction on.
Also note that labels_ is marked with an underscore, a Python convention for "don't touch this, it's private" (in absence of actual access control). Whenever possible, you should use the established API instead.
In scikit-learn, there are similar things such as fit and fit_transform.
Fit and predict or labels_ are essential for clustering.
Thus fit_predict is just efficient code, and its result is the same as the result from fit and predict (or labels).
In addition, the fitted clustering model is used only once when determining cluster labels of samples.

model evaluation with "train_test_split" not static?

According to the resources online "train_test_split" function from sklearn.cross_validation module returns data in a random state.
Does this mean if I train a model with the same data twice, I am getting two different models since the training data points used in the learning process is different in each case?
In practice can the accuracy of such two models differ a lot? Is that a possible scenario?
You can set random_state parameter to some constant value to reproduce data splits. On the other hand, it's generally a good idea to test exactly what you are trying to know - i.e. run your training at least twice with different randoms states and compare the results. If they differ a lot it's a sign that something is wrong and your solution is not reliable.

Classification results depend on random_state?

I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?
Your classification scores will depend on random_state. As #Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.
For eg.
Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...
Its mentioned in the documentation that:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
Do read the following questions and answers for better understanding:
Choosing random_state for sklearn algorithms
confused about random_state in decision tree of scikit learn
It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.
Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.

Categories