Measuring "mixtureness" of labeled data (python) - python

I have some 2D data:
The data is labeled and shown in different colors. Definitely a non supervised process will not yield any correct prediction because the data is pretty mixed (although the colors seem to have regions of preference). I want to see if it is possible to measure how mixed are points from different sets.
For this I need to define a measurement of how mixed they are (I think that this should exist). Also it would be nice to have these algorithms implemented. I am also looking for a simple predictive model that can be trained used the data shown. Thanks for your help. If possible I'm looking for these implementations in python.

Related

Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?

I predicted a model with the xgboost algorithm on python and I graphed the predicted values vs the actal ones on a scatterplot (see image).
As you can see, there are several outliers (I drew a circle around them) who greatly damage the model and I would like to get rid of them.
Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?[predicted vs actual values]
There is something called an anomaly/outlier detection system, should check that out.
Here is a link
There are several algorithms which are available in python for multivariate anomaly detection in sklearn like DBSCAN , Isolation Forest , One Class SVM etc and generally isolation forest is deemed have good anomaly/outlier detection when the dataset has high attributes. However more before using anomaly/outlier detection algorithms one needs to identify if these values are actually outlier or whether they are natural behaviour for the dataset. If yes then rather than removing the records one might have to normalize /bin or apply other feature engineering technique/ look at more complex algorithms to fit the data. What if the relation of the target variable and the independent variables is non-linear?

Machine Learning dataset with many discrete features

I am working with a medical data set that contains many variables with discrete outputs. For example: type of anesthesia, infection site, Diabetes y/n. And to deal with this I have just been converting them into multiple columns with ones and zeros and then removing one to make sure there is not a direct correlation between them but I was wondering if there was a more efficient way of doing this
It depends on the purpose of the transformation. Converting categories to numerical labels may not make sense if the ordinal representation does not correspond to the logic of the categories. In this case, the "one-hot" encoding approach you have adopted is the best way to go, if (as I surmise from your post) the intention is to use the generated variables as the input to some sort of regression model. You can achieve what you are looking to do using pandas.get_dummies.

Predicting nucleotide sequence efficiency

I am new to machine learning and I am wondering whether it would be possible to use my available biological data for clustering. I want to find out whether a group of DNA sequences can be clustered into two groups, efficient and not efficient.
I have five sets, each containing about 480 short sequences (lets call them samples). Each set is having an effect with different strength:
Set1 - Very good effect
Set2 - Good effect
Set3 - Minor effect
Set4 - Very minor effect
Set5 - No effect
Each sample has some features, e.g. free energy,starting with a specific nucleotide...
Now my question is whether I can find out which type of sample in my sets are playing a role for the effect of the whole set. My only assumption is that in set1 I have more efficient samples then in set5 (either none or very few). A very simple (not realistic) result could be, all samples which start with nucleotide 'A' end end with nucleotide 'C' are causing the effect.
Is it possible to use machine learning to find out?
Thanks!
That definitely sounds like a problem where machine learning could give good results. I recommend that you look into scikit-learn, a powerful and easy to use toolkit for machine learning in Python. There are many introductory examples and tutorials available.
For your use case, I would say that random forests could give good results, although it's hard to say without knowing more about the structure of the data. They are available in the class RandomForestClassifier in sklearn. Again, there are many tutorials and examples to be found.
Since your training data is unlabeled, you may want to look into unsupervised learning methods. A simple class of such methods are clustering algorithms. In sklearn, you can find, for instance, k-means clustering along other such algorithms. The idea would be to let the algorithm split your data into different cluster and see if there is any correlation between cluster membership and observed effect.
It is unclear from your description what the 5 sets (what sound like labels) correspond to, but I will assume that you are essentially asking about feature learning: you would like to know which features to choose to best predict what set a given sequence is from. Determining this from scratch is an open problem in machine learning and there are many possible approaches depending on the particulars of your situation.
You can select a set of features (just by making logical guesses) and calculate them for all sequences, then perform PCA on all the vectors you have generated. PCA will give you the linear combination of features that accounts for the most variability in your data which is useful in designing meaningful features.

Supervised Machine Learning: Classify types of clusters of data based on shape and density (Python)

I have multiple sets of data, and in each set of data there is a region that is somewhat banana shaped and two regions that are dense blobs. I have been able to differentiate these regions from the rest of the data using a DBSCAN algorithm, but I'd like to use a supervised algorithm to have the program then know which cluster is the banana, and which two clusters are the dense blobs, and I'm not sure where to start.
As there are 3 categories (banana, blob, neither), would doing two separate logistic regressions be the best approach (evaluate if it is banana or not-banana and if it is blob or not-blob)? or is there a good way to incorporate all 3 categories into one neural network?
Here are three data sets. In each, the banana is red. In the 1st, the two blobs are green and blue, in the 2nd the blobs are cyan and green, and in the the 3rd the blobs are blue and green. I'd like the program to (now that is has differentiated the different regions, to then label the banana and blob regions so I don't have to hand pick them every time I run the code.
As you are using python, one of the best options would be to start with some big library, offering many different approaches so you can choose which one suits you the best. One of such libraries is sklearn http://scikit-learn.org/stable/ .
Getting back to the problem itself. What are the models you should try?
Support Vector Machines - this model has been around for a while, and became a gold standard in many fields, mostly due to its elegant mathematical interpretation and ease of use (it has much less parameters to worry about then classical neural networks for instance). It is a binary classification model, but library automaticaly will create a multi-classifier version for you
Decision tree - very easy to understand, yet creates quite "rough" decision boundaries
Random forest - model often used in the more statistical community,
K-nearest neighours - most simple approach, but if you can so easily define shapes of your data, it will provide very good results, while remaining very easy to understand
Of course there are many others, but I would recommend to start with these ones. All of them support multi-class classification, so you do not need to worry how to encode the problem with three classes, simply create data in the form of two matrices x and y where x are input values and y is a vector of corresponding classes (eg. numbers from 1 to 3).
Visualization of different classifiers from the library:
So it remains a question how to represent shape of a cluster - we need a fixed length real valued vector, so what can features actually represent?
center of mass (if position matters)
skewness/kurtosis
covariance matrix (or its eigenvalues) (if rotation matters)
some kind of local density estimation
histograms of some statistics (like histogram of pairwise Euclidean distances between
pairs of points on the shape)
many, many more!
There is quite comprehensive list and detailed overview here (for three-dimensional objects):
http://web.ist.utl.pt/alfredo.ferreira/publications/DecorAR-Surveyon3DShapedescriptors.pdf
There is also quite informative presentation:
http://www.global-edge.titech.ac.jp/faculty/hamid/courses/shapeAnalysis/files/3.A.ShapeRepresentation.pdf
Describing some descriptors and how to make them scale/position/rotation invariant (if it is relevant here)
Could Neural networks help , the "pybrain" library might be the best for it.
You could set up the neural net as a feed forward network. set it so that there is an output for each class of object you expect the data to contain.
Edit :sorry if I have completely misinterpreted the question. I'm assuming you have preexisting data you can feed to train the networks to differentiate clusters.
If there are 3 categories you could have 3 outputs to the NN or perhaps a single NN for each one that simply outputs a true or false value.
I believe you are still unclear about what you want to achieve.
That of course makes it hard to give you a good answer.
Your data seems to be 3D. In 3D you could for example compute the alpha shape of a cluster, and check if it is convex. Because your "banana" probably is not convex, while your blobs are.
You could also measure e.g. whether the cluster center actually is inside your cluster. If it isn't, the cluster is not a blob. You can measure if the extends along the three axes are the same or not.
But in the end, you need some notion of "banana".

Classifying a Distribution of Points for Object Identification

I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.

Categories