Multiple Imputation within Python and Decisiontrees - python

I was trying to do multiple imputation in python.
My motivation is driven by the mice package in R, however, I am looking for something equivalent in python. I found the IterativeImputer of sklearn.
Following documentation and some posts on SO I am able to produce multiple imputed sets. However, this the imputed values are drawn from a distribution by setting sample_posterior = True. But this is not what I am looking for. I would like to draw the values not from a distribution but to be a real sample. I.e. as in R, draw from those values that are in the same leaf in a decision tree. (see page 94 https://cran.r-project.org/web/packages/mice/mice.pdf). Is there a way to change the "prediction" of a decision tree within the IterativeImputer to drawing a random observation of the same leaf?
Documentation: https://scikit-learn.org/stable/modules/impute.html
Post on SO:
IterativeImputer - sample_posterior and Imputing missing values using sklearn IterativeImputer class for MICE

miceforest does what you are looking for. It implements mean matching by default, which will pull from real samples in the data.
However, miceforest uses lightgbm as a backend. This may or may not be what you want.

Related

Is there a way to constrain a particular data point to a cluster with SciKit K Means or other clustering method?

I am using SciKit K Means cluster_centres_ method to give me a "centre of gravity" for weighted data points. I would like to be able to restrict a subset of data points to a particular cluster and have that affect the centre location. Is there a way to do this by preventing certain values from being attributed to the nearest cluster by default?
You should probably try DBSCAN, and see if it does what you want. Here is a link to several clustering algos.
https://machinelearningmastery.com/clustering-algorithms-with-python/
Just choose the model you want, fit it, and interpret the output.
Similar to the link above, you can test several clustering models using the sample code from the link below.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb

Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?

I predicted a model with the xgboost algorithm on python and I graphed the predicted values vs the actal ones on a scatterplot (see image).
As you can see, there are several outliers (I drew a circle around them) who greatly damage the model and I would like to get rid of them.
Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?[predicted vs actual values]
There is something called an anomaly/outlier detection system, should check that out.
Here is a link
There are several algorithms which are available in python for multivariate anomaly detection in sklearn like DBSCAN , Isolation Forest , One Class SVM etc and generally isolation forest is deemed have good anomaly/outlier detection when the dataset has high attributes. However more before using anomaly/outlier detection algorithms one needs to identify if these values are actually outlier or whether they are natural behaviour for the dataset. If yes then rather than removing the records one might have to normalize /bin or apply other feature engineering technique/ look at more complex algorithms to fit the data. What if the relation of the target variable and the independent variables is non-linear?

How to evaluate HDBSCAN text clusters?

I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each cluster and get the representative topics. However, I'm having a hard time evaluating the results (apart from visual analysis, which is not great as the data grows). With LDA, although it's hard to evaluate it, i've been using the coherence measure. However, does anyone have any idea on how to evaluate the clusters made by HDBSCAN? I haven't been able to find much info on it, so if anyone has any idea, I'd very much appreciate!
HDBSCAN implements Density-Based Clustering Validation in the method called relative_validity. It will allow you to compare one clustering, obtained with a given set of hyperparameters, to another one.
In general, read about cluster analysis and cluster validation.
Here's a good discussion about this with the author of the HDBSCAN library.
Its the same problem everywhere in unsupervised learning.
It is unsupervised, you are trying to discover something new and interesting. There is no way for the computer to decide whether something is actually interesting or new. It can decide and trivial cases when the prior knowledge is coded in machine processable form already, and you can compute some heuristics values as a proxy for interestingness. But such measures (including density-based measures such as DBCV are actually in no way better to judge this than the clustering algorithm itself is choosing the "best" solution).
But in the end, there is no way around manually looking at the data, and doing the next steps - try to put into use what you learned of the data. Supposedly you are not invory tower academic just doing this because of trying to make up yet another useless method... So use it, don't fake using it.
You can try the clusteval library. This library helps your to find the optimal number of clusters in your dataset, also for hdbscan. When you have the cluster labels, you can start enrichment analysis using hnet.
pip install clusteval
pip install hnet
Example:
# Import library
from clusteval import clusteval
# Set the method
ce = clusteval(method='hdbscan')
# Evaluate
results = ce.fit(X)
# Make plot of the evaluation
ce.plot()
# Make scatter plot using the first two coordinates.
ce.scatter(X)
So at this point you have the optimal detected cluster labels and now you may want to know whether there is association between any of the clusters with a (group of) feature(s) in your meta-data. The idea is to compute for each cluster label how often it is seen for a particular class in your meta-data. This can be defined with a P-value. The lower the P-value (below alpha=0.05), the less likely it happened by random chance.
results is a dict and contains the optimal cluster labels in the key labx. With hnet we can compute the enrichment very easily. More information can be found here: https://erdogant.github.io/hnet
# Import library
import hnet
# Get labels
clusterlabels = results['labx']
# Compute the enrichment of the cluster labels with the dataframe df
enrich_results = hnet.enrichment(df, clusterlabels)
When we look at the enrich_results, there is a column with category_label. These are the metadata variables of the dataframe df that we gave as an input. The second columns: P stands for P-value, which is the computed significance of the catagory_label with the target variable y. In this case, target variable y are are the cluster labels clusterlabels.
The target labels in y can be significantly enriched more then once. This means that certain y are enriched for multiple variables in the dataframe. This can occur because we may need to better estimate the cluster labels or its a mixed group or something else.
More information about cluster enrichment can be found here:
https://erdogant.github.io/hnet/pages/html/Use%20Cases.html#cluster-enrichment

Adjusted Boxplot in Python

For my thesis, I am trying to identify outliers in my data set. The data set is constructed of 160000 times of one variable from a real process environment. In this environment however, there can be measurements that are not actual data from the process itself but simply junk data. I would like to filter them out with I little help of literature instead of only "expert opinion".
Now I've read about the IQR method of seeing whether possible outliers lie when dealing with a symmetric distribution like the normal distribution. However, my data set is right skewed and by distribution fitting, inverse gamma and lognormal where the best fit.
So, during my search for methods for non-symmetric distributions, I found this topic on crossvalidated where user603's answer is interesting in particular: Is there a boxplot variant for Poisson distributed data?
In user603's answer, he states that an adjusted boxplot helps to identify possible outliers in your dataset and that R and Matlab have functions for this
(There is an 𝚁R implementation of this
(πš›πš˜πš‹πšžπšœπšπš‹πšŠπšœπšŽ::πšŠπšπš“πš‹πš˜πš‘()robustbase::adjbox()) as well as
a matlab one (in a library called πš•πš’πš‹πš›πšŠlibra)
I was wondering if there is such a function in Python. Or is there a way to calculate the medcouple (see paper in user603's answer) with python?
I really would like to see what comes out the adjusted boxplot for my data..
In the module statsmodels.stats.stattools there is a function medcouple(), which is the measure of the skewness used in the Adjusted Boxplot.
enter link description here
With this variable you can calculate the interval beyond which outliers are defined.

Comparing datasets to nonstandard probability distributions in Python

I have a few large sets of data which I have used to create non-standard probability distributions (using numpy.histogram to bin the data, and scipy.interpolate's interp1d function to interpolate the resulting curves). I have also created a function which can sample from these custom PDFs using the scipy.stats package.
My goal is to see how varying the size of my samples changes the goodness of fit to both the distributions they came from, and the other PDFs as well, and determine how large a sample is necessary to completely determine whether it came from one or other of my custom PDFs.
To do this I've gathered that I need to use some sort of nonparametric statistical analysis, i.e. seeing whether a set of data has been drawn from a provided probability distribution. Doing a bit of research, it seems like the Anderson-Darling test is ideal for this, however its implementation in python (scipy.stats.anderson) seems to only be usable for preset probability distributions such as normal, exponential, etc.
So my question is: given my many nonstandard PDFs (or CDFs if necessary, or the data I used to create them) what is the best way to work out how well a set of sample data fits each model in Python? If it is the Anderson-Darling test, is there some way of defining a custom PDF to test against?
Thanks. Any help is much appreciated.
(1) "Is it from distribution X" is generally a question which can be answered a priori, if at all; a statistical test for it will only tell you "I have a large sample / not a large sample", which may be true but not too useful. If you are trying to classify new data into one distribution or another, my advice is to look at it as a classification problem and use your constructed pdf's to compute p(class | data) = p(data | class) p(class) / p(data) where the key part p(data | class) is your histogram. Maybe you can say more about your problem domain.
(2) You could apply the Kolmogorov-Smirnov test, but it's really pointless, as mentioned above.

Categories