I want to draw categorical vectors where its prior is a product of Dirichlet distributions. The categories are fixed and each element in the categorical vector corresponds to a different Dirichlet prior. Here is a categorical vector of length 33 with 4 categories, setup with prior with a Dirichlet.
import pymc3 as pm
with pm.Model() as model3:
theta = pm.Dirichlet(name='theta',a=np.ones((33,4)), shape=(33,4))
seq = [pm.Categorical(name='seq_{}'.format(str(i)), p=theta[i,:], shape=(1,)) for i in range(33)]
step1 = pm.Metropolis(vars=[theta])
step2 = [pm.CategoricalGibbsMetropolis(vars=[i]) for i in seq]
trace = pm.sample(50, step=[step1] + [i for i in step2])
However this approach is cumbersome as I have to do some array indexing to get the categorical vectors out. Are there better ways of doing this?

You don't need to specify the shape. Note that the way you've set it up there are 33 different categorical variables; I'm assuming that's what you've intended. Here's the easier way to do that:
with pm.Model() as model:
theta = pm.Dirichlet(name='theta',a=np.ones(4))
children = [pm.Categorical(f"seq_{i}", p=theta) for i in range(33)]


Is it possible to fit a multivariate GMHMM in hmmlearn?

I know it is possible to fit several sequences into hmmlearn but it seems to me that these sequences need to be drawn from the same distributions.
Is it possible to fit a GMHMM with several observations sequences drawn from different distributions in hmmlearn?
My use case :
I would like to fit a GMHMM with K financial time series from different stocks and predict the market regime that generated the K stock prices at a specified time.
So the matrix input has dimension N (number of dates) × K (number of stocks).
If hmmlearn can't do that, please tell me if it is possible with another package in python or R?
Thanks for you help!
My approach to your problem will be to use a multi-variate Gaussian for emission probabilities.
For example: let's assume that K is 2, i.e., the number of locations is 2.
In hmmlearn, the K will be encoded in the dimensions of the mean matrix.
See, this example Sampling from HMM has a 2-dimensional output. In other words the X.shape = (N, K) where N is the length of the sample 500 in this case, and K is the dimension of the output which is 2.
Notice that the authors plotted each dimension on an axis, i.e., x-axis plots the first dimension X[:, 0], and the y-axis the second dimension X[:, 1].
To train your model, make sure that X1 and X2 are of the same shape as the sampled X in the example, and form the training dataset as described here.
In summary, adapt the example to your case by adjusting the K instead of K=2 and convert it to the GMHMM instead of GaussianHMM.
# Another example
model = hmm.GaussianHMM(n_components=5, covariance_type="diag", n_iter=100)
K = 3 # Number of sites
model.n_features = K # initialise that the model has size of observations = K
# Create a random training sequence (only 1 sequence) with length = 100.
X1 = np.random.randn(100, K) # 100 observation for K sites
# Sample the fitted model
X, Z = model.sample(200)

Unsupervised learning clustering 1D array

I am faced with the following array:
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
What I would like to do is extract the cluster with the highest scores. That would be
best_cluster = [200,297,275,243]
I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?
Try MeanShift. From the sklean user guide of MeanShift:
The algorithm automatically sets the number of clusters, ...
Modified demo code:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)
ms = MeanShift(bandwidth=None, bin_seeding=True)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.
BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.
See also:
Overview of clustering methods
Choosing the right estimator
Clustering is overkill here
Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].
Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.
This is O(n), much faster than all the others mentioned. Also very understandable and predictable.
On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.
HDBSCAN is the best clustering algorithm and you should always use it.
Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.
For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.
Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.
To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:
from hdbscan import HDBSCAN
import numpy as np
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))
clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)
best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
The output is [297 200 275 243]. Original order is not preserved. C'est la vie.

How can I fit a GMM to a 1D Gaussian plot with sklearn?

I realize there are several articles that demonstrate how to fit a GMM to a 1D Gaussian with sklearn ([1] and [2], to name a few). However, in all of those cases, the data is present as single points where the distribution is Gaussian. In my case, I'm essentially have a frequency table (I'm working with spectroscopic data), where the distribution is Gaussian, but the individual points are unknown.
My distribution (i.e., the data I'm trying to fit) looks like this: 1D Gaussian Peak
I'd like to use GMM to deconvolve the 2 initial Gaussian distributions that make up this peak.
So far, I've tried the following (assume my data is a 200x2 array, with position in one column and AFU on the second) :
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
def gengmm(nc=4, n_iter = 2):
g = mixture.GMM(n_components=nc) # number of components
g.init_params = "" # No initialization
g.n_iter = n_iter # iteration of EM method
return g
I tried to see if I could fit this peak to just a single Gaussian:
g = gengmm(1, 100)
However, the mean and covariance I get don't define my data particularly well (notably, the mean for that Gaussian distribution is 127.5, which is not what is recovered with a 1 component GMM).
Is there an easier way to do this? (I realize I can just use a least-squares fit to recover the initial Gaussian, but again, I'm trying to ultimately use this to determine the two underlying Gaussians distributions that make up the final one.)

Hierarchical clustering for categorical data in python

I have a categorical attributes that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, bad strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work.
Now I wish to apply hierarchical clustering on it. I found this code:
import scipy
import scipy.cluster.hierarchy as sch
X = scipy.randn(100, 2) # 100 2-dimensional observations
d = sch.distance.pdist(X) # vector of (100 choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')
However, X in above code is numeric; I have categorical data.
Is there some way that I can use a numarray of categorical data to find the distance?
In other words can I use categorical data of string values to find the distance?
I would then use that distance in sch.linkage(d, method='complete')
I think we've identified the problem, then: you leave the X values as they are, string data. You can pass those to pdist, but you also have to supply a 2-arity function (2 inputs, numeric output) for the distance metric.
The simplest one would be that equal classifications have 0 distance; everything else is 1. You can do this with
d = sch.distance.pdist(X, lambda u, v: u != v)
If you have other class discrimination in mind, just code logic to return the desired distance, wrap it in a function, and then pass the function name to pdist. We can't help with that, because you've told us nothing about your classes or the model semantics.
Does that get you moving?
Another possibility is the use of the Hamming distance.
Y = pdist(X, 'hamming')
Computes the normalized Hamming distance, or the proportion of those
vector elements between two n-vectors u and v which disagree. To save
memory, the matrix X can be of type boolean.
If your categorical data is represented by a single character e.g.: "m"/"f" it could be what you are looking for.

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
float: Purity score
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)
