How to find the success rate of a clustering algorithm?

How to find the success rate of a clustering algorithm? - python

I have implemented several clustering algorithms on an image dataset.
I'm interested in deriving the success rate of clustering. I have to detect the tumor area, in the original image I know where the tumor is located, I would like to compare the two images and obtain the percentage of success.
Following images:
Original image: I know the position of cancer
Image after clustering algorithm
I'm using python 2.7.

Segmentation Accuracy
This is a pretty common problem addressed in image segmentation literature, e.g., here is a StackOverflow post
One common approach is to consider the ratio of "correct pixels" to "incorrect pixels," which is common in image segmentation for safety domain, e.g., Mask RCNN, PixelNet.
Treating it as more of an object detection task, you could take the overlap of the hull of the objects and just measure accuracy (commonly broken down into precision, recall, f-score, and other measures with various bias/skews). This allows you to produce an ROC curve that can be calibrated for false positives/false negatives.
There is no domain-agnostic consensus on what's correct. KITTI provides both.
Mask RCNN is open source state-of-the-art, and provides implemenations
in python of
Computing image matching between segmented and original
Displaying the differences
In your domain (medicine), standard statistical rules apply. Use a holdout set. Cross validate. Etc. (*)
Note: although the literature space is dauntingly large, I'd caution you to take a look at some domain-relevant papers, as they may take fewer "statistical short cuts" than other vision (digit recognition e.g.) projects accept.
"Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool" provides some summary methods in your your domain
"Current methods in image segmentation" has about 2500 citations but is a little older.
"Review of MR image segmentation techniques using pattern recognition" is a little older still and will get you safely into "traditional" vision models.
Automated Segmentation of MR Images of Brain Tumors is largely about its segmentation validation process
Python
Besides the mask rcnn links above, scikit-learn provides some extremely user friendly tools and is considered part of the standard science "stack" for python.
Implementing the difference between images in python is trivial (using numpy). Here's an overkill SO link.
Bounding box intersection in python is easy to implement on one's own; I'd use a library like shapely if you want to measure general polygon intersection.
Scikit-learn has some nice machine-learning evaluation tools, for example,
ROC curves
Cross validation
Model selection
A million others
Literature Searching
One reason that you may have trouble searching for the answer is because you're trying to measure performance of an unsupervised method, clustering, in a supervised learning arena. "Clusters" are fundamentally under-defined in mathematics (**). You want to be looking at the supervised learning literature for accuracy measures.
There is literature on unsupervised learning/clustering, too, which looks for topological structure, generally. Here's a very introductory summary. I don't think that is what you want.
A common problem, especially at scale, is that supervised methods require labels, which can be time consuming to produce accurately for dense segmentation. Object detection makes it a little easier.
There are some existing datasets for medicine ([1], [2], e.g.) and some ongoing research in label-less metrics. If none of these are options for you, then you may have to revert to considering it an unsupervised problem, but evaluation becomes very different in scope and utility.
Footnotes
[*] Vision people sometimes skip cross validation even though they shouldn't, mainly because the models are slow to fit and they're a lazy bunch. Please don't skip a train/test/validation split, or your results may be dangerously useless
[**] You can find all sorts of "formal" definitions, but never two people to agree on which one is correct or most useful. Here's denser reading

Related

simple way to detect street area in google map images (aerial images)

im trying to detect the area of the street in image without any deep learning method.
say i have this image:
i am looking for any simple method to detect street portion of the image like the following:
now i know this might not be very accurate, and accuracy is not the problem at all , i am trying to achieve this without using any deep learning method.

Hough line can give direct straight line measure. but i don't thin it will give you exactly what you want. As shown below
You need a lot more complicated algorithms such as deep sematic segmentation model. and train based on that.
Even you don't like deep learning. traditional algo such as variational analysis, SVM learning or adaboost is also very complicated and you wont be able to use it easily. You need to have mucher deeper understanding on those topic.
if you really want you can start with variational analysis, active contour model, snake energy for extracting the road first. This variational analysis is proven to be working for a complex scenes and extract a particular model as shown in the image below. your road is the empty low gradient region and all building tree nearby are high gradient responses that you don't want.
My suggestion is to make your life easier by using pre trained model and extra the surface model. Download, run python script. that's all
There are a few open-source implementations that you can try such as this
https://github.com/ArkaJU/U-Net-Satellite
https://github.com/Paulymorphous/Road-Segmentation
https://github.com/avanetten/cresi
Based on the predicted mask. then you can get production accurately as shown below
This would be the result that you are looking for
Regards
Shenghai Yuan

How to understand/debug/visualize U-Net segmentation results

I am training a U-Net architecture to for a segmentation task. This is in Python using Keras. I have now run into an issue, that I am trying to understand:
I have two very similar images from a microscopy image series (these are consecutive images), where my current U-Net model performs very good on one, but performs extremely poor on the immediately following one. However, there is little difference between the two to the eye and the histograms also look very much alike. Also on other measurements the model performs great across the whole frame-range, but then this issue appears for other measurements.
I am using data-augmentation during training (histogram stretching, affine transformation, noise-addition) and I am surprised that still the model is so brittle.
Since the U-Net is still mostly a black-box to me, I want to find out steps I can take to better understand the issue and then adjust the training/model accordingly.
I know there are ways to visualize what individual layers learn (e.g. as discussed F. Chollets book see here) and I should be able to apply these to U-Nets, which is fully convolutional.
However, these kinds of methods are practically always discussed in the realm of classifying networks - not semantic segmentation.
So my question is:
Is this the best/most direct approach to reach an understanding of how U-Net models attain a segmentation result? If not, what are better ways to understand/debug U-Nets?

I suggest you use the U-Net container on NGC https://ngc.nvidia.com/catalog/resources/nvidia:unet_industrial_for_tensorflow
I also suggest you read this: Mixed Precision Training: https://arxiv.org/abs/1710.03740
https://developer.nvidia.com/blog/mixed-precision-training-deep-neural-networks/
Let me know how you are progressing and if any public repo, happy to have a look

Signal feature identification

I'm am trying to identify phonemes in voices using a training database of known ones.
I'm wondering if there is a way of identifying common features within my training sample and using that to classify a new one.
It seems like there are two paths:
Give the process raw/normalised data and it will return similar ones
Extract certain metrics such as pitch, formants etc and compare to training set
My interest is the first!
Any recommendations on machine learning or regression methods/algorithms?

Since you tagged Python, I highly recommend looking into scikit-learn, an excellent Python library for Machine Learning. Their docs are very thorough, and should give you a good crash course in Machine Learning algorithms and implementation (including classification, regression, clustering, etc)

Your points 1 and 2 are not very different: 1) is the end results of a classification problem 2) is the feature that you give for classification. What you need is a good classifier (SVM, decision trees, hierarchical classifiers etc.) and a good set of features (pitch, formants etc. that you mentioned).

Features considered by ExtraTreeRegressor of Scikit Learn to construct Random Forest

I came across this example which involves completion of face for the test data set. Here, a value of 32 for max_features is passed to the ExtraTreesRegressor() function. I learnt that decision trees are constructed, which selects random features from the input data set. For the example from the above link, images are used as train and test data set. This wiki page describes various types of image features. Now I am not able to understand which features dose sklearn.ensemble.ExtraTreeRegressor look for or extract from the image data set provided as input to construct the random forest. Also, how is it determined that a value of 32 is optimum for max_features. Please help me with this.

Random forests do not do feature extraction. They use the features in the dataset given to them, which in this example are just pixel intensities from the Olivetti faces dataset.
The max_features parameter to an ExtraTreesRegressor determines "the number of features to consider when looking for the best split" (inside the decision tree learning algorithm employed by the forest).
The value 32 was probably determined empirically.

The features used here are the raw pixel values. As the images in the dataset are aligned and quite similar, that seems to be enough for the task.

As others said: in this naive example there is no feature extraction: the extra trees just use the raw pixels as features.
In a more realistic computer vision setting it is very likely that performing hand tuned feature extraction will lead to more interesting models. The kind of features to extract depends on the computer vision task you want to achieve. Read the literature or examples from the OpenCV library to know the state of the art in computer vision (leaving neural net-based representation learning aside as bleeding edge research for now).
The 32 value for the parameter can be randomized searched. See this example from the master branch for an example:
http://scikit-learn.org/dev/auto_examples/randomized_search.html#example-randomized-search-py

Learning and using augmented Bayes classifiers in python

I'm trying to use a forest (or tree) augmented Bayes classifier (Original introduction, Learning) in python (preferably python 3, but python 2 would also be acceptable), first learning it (both structure and parameter learning) and then using it for discrete classification and obtaining probabilities for those features with missing data. (This is why just discrete classification and even good naive classifiers are not very useful for me.)
The way my data comes in, I'd love to use incremental learning from incomplete data, but I haven't even found anything doing both of these in the literature, so anything that does structure and parameter learning and inference at all is a good answer.
There seem to be a few very separate and unmaintained python packages that go roughly in this direction, but I haven't seen anything that is moderately recent (for example, I would expect that using pandas for these calculations would be reasonable, but OpenBayes barely uses numpy), and augmented classifiers seem completely absent from anything I have seen.
So, where should I look to save me some work implementing a forest augmented Bayes classifier? Is there a good implementation of Pearl's message passing algorithm in a python class, or would that be inappropriate for an augmented Bayes classifier anyway?
Is there a readable object-oriented implementation for learning and inference of TAN Bayes classifiers in some other language, which could be translated to python?
Existing packages I know of, but found inappropriate are
milk, which does support classification, but not with Bayesian classifiers (and I defitinetly need probabilities for the classification and unspecified features)
pebl, which only does structure learning
scikit-learn, which only learns naive Bayes classifiers
OpenBayes, which has only barely changed since somebody ported it from numarray to numpy and documentation is negligible.
libpgm, which claims to support an even different set of things. According to the main documentation, it does inference, structure and parameter learning. Except there do not seem to be any methods for exact inference.
Reverend claims to be a “Bayesian Classifier”, has negligible documentation, and from looking at the source code I am lead to the conclusion that it is mostly a Spam classifier, according to Robinson's and similar methods, and not a Bayesian classifier.
eBay's bayesian Belief Networks allows to build generic Bayesian networks and implements inference on them (both exact and approximate), which means that it can be used to build a TAN, but there is no learning algorithm in there, and the way BNs are built from functions means implementing parameter learning is more difficult than it might be for a hypothetical different implementation.

I'm afraid there is not an out-of-the-box implementation of Random Naive Bayes classifier (not that I am aware of) because it is still academic matters. The following paper present the method to combine RF and NB classifiers (behind a paywall) : http://link.springer.com/chapter/10.1007%2F978-3-540-74469-6_35
I think you should stick with scikit-learn, which is one of the most popular statistical module for Python (along with NLTK) and which is really well documented.
scikit-learn has a Random Forest module : http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees . There is a submodule which may (I insist of the uncertainty) be used to pipeline towards NB classifier :
RandomTreesEmbedding implements an unsupervised transformation of the
data. Using a forest of completely random trees, RandomTreesEmbedding
encodes the data by the indices of the leaves a data point ends up in.
This index is then encoded in a one-of-K manner, leading to a high
dimensional, sparse binary coding. This coding can be computed very
efficiently and can then be used as a basis for other learning tasks.
The size and sparsity of the code can be influenced by choosing the
number of trees and the maximum depth per tree. For each tree in the
ensemble, the coding contains one entry of one. The size of the coding
is at most n_estimators * 2 ** max_depth, the maximum number of leaves
in the forest.
As neighboring data points are more likely to lie within the same leaf
of a tree, the transformation performs an implicit, non-parametric
density estimation.
And of course there is a out-of-core implementation of Naive Bayes classifier, which can be used incrementally : http://scikit-learn.org/stable/modules/naive_bayes.html
Discrete naive Bayes models can be used to tackle large scale text
classification problems for which the full training set might not fit
in memory. To handle this case both MultinomialNB and BernoulliNB
expose a partial_fit method that can be used incrementally as done
with other classifiers as demonstrated in Out-of-core classification
of text documents.

I was similarly confused as to how to do exact inference with libpgm. However, turns out it is possible. For example (from libpgm docs),
import json
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
# load nodedata and graphskeleton
nd = NodeData()
skel = GraphSkeleton()
nd.load("../tests/unittestdict.txt")
skel.load("../tests/unittestdict.txt")
# toporder graph skeleton
skel.toporder()
# load evidence
evidence = dict(Letter='weak')
query = dict(Grade='A')
# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)
# load factorization
fn = TableCPDFactorization(bn)
# calculate probability distribution
result = fn.condprobve(query, evidence)
# output
print json.dumps(result.vals, indent=2)
print json.dumps(result.scope, indent=2)
print json.dumps(result.card, indent=2)
print json.dumps(result.stride, indent=2)
To get the example to work, here is the datafile (I replaced None with null and saved as a .json).
I know this is quite late to the game, but this was the best post I found when searching for a resource to do Bayesian networks with Python. I thought I'd answer in case anyone else is looking for this. (Sorry, would have commented, but just signed up for SO to answer this and rep isn't high enough.)

R's bnlearn has implementations for both Naive Bayes and Tree-augmented Naive Bayes classifiers. You can use rpy2 to port these to Python.
http://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf

There seems to be no such thing yet.
The closest thing currently seems to be eBay's open source implementation bayesian of Belief Networks. It implements inference (two exact ways, and approximate), which means that it can be used to build a TAN. An example (at the moment still an ugly piece of spaghetti code) for that can be found in my open20q repository.
Advantages:
It works.
That is, I now have an implementation of TAN inference, based on bayesian belief network inference.
With Apache 2.0 and 3-clause BSD style licenses respectively, it is legally possible to combine bayesian code and libpgm code to try to get inference and learning to work.
Disadvantages:
There is no learning whatsoever in bayesian. Trying to combine something like libpgm learning with bayesian classes and inference will be a challenge.
Even more so as bayesian assumes that nodes are given by factors which are fixed python functions. Parameter learning requires some wrapping code to enable tweaking the probabilities.
bayesian is written in pure python, using dicts etc. as basic structures, not making use of any speedup numpy, pandas or similar packages might bring, and is therefore quite slow even for the tiny example I build.

I know it's a bit late in the day, but the Octave forge NaN package might be of interest to you. One of the classifiers in this package is an Augmented Naive Bayesian Classifier. The code is GPL'ed so you could easily port it to Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.