Matplotlib: How to convert a histogram to a discrete probability mass function? - python

I have a question regarding the hist() function with matplotlib.
I am writing a code to plot a histogram of data who's value varies from 0 to 1. For example:
values = [0.21, 0.51, 0.41, 0.21, 0.81, 0.99]
bins = np.arange(0, 1.1, 0.1)
a, b, c = plt.hist(values, bins=bins, normed=0)
plt.show()
The code above generates a correct histogram (I could not post an image since I do not have enough reputation). In terms of frequencies, it looks like:
[0 0 2 0 1 1 0 0 1 1]
I would like to convert this output to a discrete probability mass function, i.e. for the above example, I would like to get a following frequency values:
[ 0. 0. 0.333333333 0. 0.166666667 0.166666667 0. 0. 0.166666667 0.166666667 ] # each item in the previous array divided by 6)
I thought I simply need to change the parameter in the hist() function to 'normed=1'. However, I get the following histogram frequencies:
[ 0. 0. 3.33333333 0. 1.66666667 1.66666667 0. 0. 1.66666667 1.66666667 ]
This is not what I expect and I don't know how to get the discrete probability mass function who's sum should be 1.0. A similar question was asked in the following link (link to the question), but I do not think the question was resolved.
I appreciate for your help in advance.

The reason is norm=True gives the probability density function. In probability theory, a probability density function or density of a continuous random variable, describes the relative likelihood for this random variable to take on a given value.
Let us consider a very simple example.
x=np.arange(0.1,1.1,0.1)
array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
# Bin size
bins = np.arange(0.05, 1.15, 0.1)
np.histogram(x,bins=bins,normed=1)[0]
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
# Change the bin size
bins = np.arange(0.05, 1.15, 0.2)
np.histogram(x,bins=bins,normed=1)[0]
[ 1., 1., 1., 1., 1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.2, 0.2, 0.2, 0.2, 0.2]
As, you can see in the above, the probability that x will lie between [0.05-0.15] or [0.15-0.25] is 1/10 whereas if you change the bin size to 0.2 then the probability that it will lie between [0.05-0.25] or [0.25-0.45] is 1/5. Now these actual probability values are dependent on the bin-size, however, the probability density is independent of the bins size. Thus, this is the only proper way to do the above, otherwise one would need to state the bin-width in each of the plot.
So in your case if you really want to plot the probability value at each bin (and not the probability density) then you can simply divide the frequency of each histogram by the number of total elements. However, I would suggest you not to do this unless you are working with discrete variables and each of your bins represent a single possible value of this variable.

Plotting a Continuous Probability Function(PDF) from a Histogram – Solved in Python. refer this blog for detailed explanation. (http://howdoudoittheeasiestway.blogspot.com/2017/09/plotting-continuous-probability.html)
Else you can use the code below.
n, bins, patches = plt.hist(A, 40, histtype='bar')
plt.show()
n = n/len(A)
n = np.append(n, 0)
mu = np.mean(n)
sigma = np.std(n)
plt.bar(bins,n, width=(bins[len(bins)-1]-bins[0])/40)
y1= (1/(sigma*np.sqrt(2*np.pi))*np.exp(-(bins - mu)**2 /(2*sigma**2)))*0.03
plt.plot(bins, y1, 'r--', linewidth=2)
plt.show()

Related

Vectorizing a Numpy operation on differently shaped matrices

I just cannot find a solution to the following problem:
Consider two NumPy.arrays, one of shape (10,64,10) and one of (x, 64).
Array A (10,64,10) represents 10 classes with 64 features and over each of these features I got a PDF split in 10 bins --> (Classes, Features, Bins). Each value in that innermost array represents a probability.
[[[0.62, 0., 0. ],
[0.12, 0.09, 0.01],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ]],
[[0.62, 0., 0. ],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ],
[0.62, 0., 0. ]]]
(simplified to (2,4,3) so you can test it by copying it directly. The representating classes are "0" and "1")
Array B (X, 64) are the instance of a dataset X and the bin-index the i`th feature belongs to\
[[0, 0, 2, 1]
[0, 0, 1, 0]
[0, 2, 1, 0]]
(simplified to (X=3, 4))
What I want to do is for each row in Array B, e.g. [0, 0, 2, 1] I want to get the probability that when the bin for the first feature is 0 the class is "1" and that the class is "0".
The expected output for the first instance here would be:
"0" = [0.62, 0.12, 0.00, 0.00]
"1" = [0.62, 0.59, 0.00, 0.00]
and if possible then for all X instances.
I do not expect any kind of Dictionary or anything alike but just some array that contains the values in a somewhat sorted manner (can also be another sort than shown in the example)
Of course, I could do all this in giant nested for-loops, but I want at least some vectorization. As anybody any good suggestions, our answer does not have to be a full-fledged solution.
EDIT:
The best nested loop I came up with was
prediction = np.empty((bins.shape[0], histograms.shape[1], histograms.shape[0]))
for n, instance in enumerate(bins):
for i, instance_bin in enumerate(instance):
prediction[n,i] = histograms[:,i, instance_bin] # Prob for every class x that the bin given in "instance_bin" of feature "i" corresponds to a possible instance of that class
histograms = Array A;
bins = Array B
Please also tell me of every other bad-practise if you find any in my way of working with numpy or anything else in this snipped.

How can I avoid tetrahedra lying in a plane using scipy.Delaunay?

I'm trying to create a tetrahedral mesh in Python 3.7.3.. But some tetrahedra are flat, i.e. their vertices are on a flat surface.
import numpy as np
from scipy.spatial import Delaunay
# Coordinates of 3x3x3 equally distant points of a cube
x = np.linspace(0, 1, 3)
X1, X2, X3 = np.meshgrid(x,x,x)
vertices = np.hstack([X1.reshape(-1,1),X2.reshape(-1,1),X3.reshape(-1,1)])
# Using Delaunay
tri = Delaunay(vertices).simplices
# tetrahedra
simplices = vertices[tri,:]
As you can see below, already with the third tetrahedron all y-coordinates are 0.5. This leads later to a singular matrix.
print(simplices[0:3])
[[[1. 0.5 1. ]
[0.5 1. 0.5]
[1. 0.5 0.5]
[0.5 0.5 0.5]]
[[1. 0.5 1. ]
[1. 1. 0.5]
[0.5 1. 0.5]
[1. 0.5 0.5]]
[[1. 0.5 1. ]
[0.5 0.5 1. ]
[1. 0.5 0.5]
[0.5 0.5 0.5]]]
Do you know how I can work around this problem? Thank you very much.
Unlike 2D, for 3D there is no known algorithm that definitely generates a Delaunay triangulation of a given domain. There are however some mesh generation packages which produce pretty good tetrahedral meshes. For example
gmsh/pygmsh
pygalmesh
mshr
...
(Disclaimer: I'm the author of pygmsh and pygalmesh.)

norm parameters in sklearn.preprocessing.normalize

In sklearn documentation says "norm" can be either
norm : ‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).
The documentation about normalization isn't clearly stating how ‘l1’, ‘l2’, or ‘max’ are calculated.
Can anyone clear these?
Informally speaking, the norm is a generalization of the concept of (vector) length; from the Wikipedia entry:
In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space.
The L2-norm is the usual Euclidean length, i.e. the square root of the sum of the squared vector elements.
The L1-norm is the sum of the absolute values of the vector elements.
The max-norm (sometimes also called infinity norm) is simply the maximum absolute vector element.
As the docs say, normalization here means making our vectors (i.e. data samples) having unit length, so specifying which length (i.e. which norm) is also required.
You can easily verify the above adapting the examples from the docs:
from sklearn import preprocessing
import numpy as np
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_l1 = preprocessing.normalize(X, norm='l1')
X_l1
# array([[ 0.25, -0.25, 0.5 ],
# [ 1. , 0. , 0. ],
# [ 0. , 0.5 , -0.5 ]])
You can verify by simple visual inspection that the absolute values of the elements of X_l1 sum up to 1.
X_l2 = preprocessing.normalize(X, norm='l2')
X_l2
# array([[ 0.40824829, -0.40824829, 0.81649658],
# [ 1. , 0. , 0. ],
# [ 0. , 0.70710678, -0.70710678]])
np.sqrt(np.sum(X_l2**2, axis=1)) # verify that L2-norm is indeed 1
# array([ 1., 1., 1.])

Different results from scipy.stats.spearmanr depending on how data is produced

I'm having some weird problem using spearmanr from scipy.stats. I'm using the values of a polynomial to get some correlations that are a bit more interesting to work with, but if I manually enter the values (as a list, converted to a numpy array) I get a different correlation to what I get if I calculate the values using a function. The code below should demonstrate what I mean:
import numpy as np
from scipy.stats import spearmanr
data = np.array([ 0.4, 1.2, 1. , 0.4, 0. , 0.4, 2.2, 6. , 12.4, 22. ])
axis = np.arange(0, 10, dtype=np.float64)
print(spearmanr(axis, data))# gives a correlation of 0.693...
# Use this polynomial
poly = lambda x: 0.1*(x - 3.0)**3 + 0.1*(x - 1.0)**2 - x + 3.0
data2 = poly(axis)
print(data2) # It is the same as data
print(spearmanr(axis, data2))# gives a correlation of 0.729...
I did notice that the arrays are subtly different (i.e. data - data2 is not exactly zero for all elements), but the difference is tiny - order of 1e-16.
Is such a tiny difference enough to throw off spearmanr by this much?
Is such a tiny difference enough to throw off spearmanr by this much?
Yes, because Spearman's r is based on the sample rank. Such tiny differences can change the rank of values that would otherwise be equal:
sp.stats.rankdata(data)
# array([ 3., 6., 5., 3., 1., 3., 7., 8., 9., 10.])
# Note that all three values of 0.4 get the same rank 3.
sp.stats.rankdata(data2)
# array([ 2.5, 6. , 5. , 2.5, 1. , 4. , 7. , 8. , 9. , 10. ])
# Note that two values 0.4 get the rank 2.5 and one gets 4.
If you add a small gradient (larger than the numerical difference you observe) to break such ties, you will get the same result:
print(spearmanr(axis, data + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
print(spearmanr(axis, data2 + np.arange(10)*1e-12))
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)
This, however, will break any ties that may be intentional and can lead to over- or underestimating the correlation. numpy.round may be the preferable solution if the data is expected to have discrete values.

affinity propagation in python

I am seeing something strange while using AffinityPropagation from sklearn. I have a 4 x 4 numpy ndarray - which is basically the affinity-scores. sim[i, j] has the affinity score of [i, j]. Now, when I feed into the AffinityPropgation function, I get a total of 4 labels.
here is an similar example with a smaller matrix:
In [215]: x = np.array([[1, 0.2, 0.4, 0], [0.2, 1, 0.8, 0.3], [0.4, 0.8, 1, 0.7], [0, 0.3, 0.7, 1]]
.....: )
In [216]: x
Out[216]:
array([[ 1. , 0.2, 0.4, 0. ],
[ 0.2, 1. , 0.8, 0.3],
[ 0.4, 0.8, 1. , 0.7],
[ 0. , 0.3, 0.7, 1. ]])
In [217]: clusterer = cluster.AffinityPropagation(affinity='precomputed')
In [218]: f = clusterer.fit(x)
In [219]: f.labels_
Out[219]: array([0, 1, 1, 1])
This says (according to Kevin), that the first sample (0th-indexed row) is a cluster (Cluster # 0) on its own and the rest of the samples are in another cluster (cluster # 1). But, still, I do not understand this output. What is a sample here? What are the members? I want to have a set of pairs (i, j) assigned to one cluster, another set of pairs assigned to another cluster and so on.
It looks like a 4-sample x 4-feature matrix..which I do not want. Is this the problem? If so, how to convert this to a nice 4-sample x 4-sample affinity-matrix?
The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) says
fit(X, y=None)
Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
Parameters:
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
Data matrix or, if affinity is precomputed, matrix of similarities / affinities.
Thanks!
By your description it sounds like you are working with a "pairwise similarity matrix": x (although your example data does not show that). If this is the case your matrix should be symmertric so that: sim[i,j] == sim[j,i] with your diagonal values equal to 1. Example similarity data S:
S
array([[ 1. , 0.08276253, 0.16227766, 0.47213595, 0.64575131],
[ 0.08276253, 1. , 0.56776436, 0.74456265, 0.09901951],
[ 0.16227766, 0.56776436, 1. , 0.47722558, 0.58257569],
[ 0.47213595, 0.74456265, 0.47722558, 1. , 0.87298335],
[ 0.64575131, 0.09901951, 0.58257569, 0.87298335, 1. ]])
Typically when you already have a distance matrix you should use affinity='precomputed'. But in your case, you are using similarity. In this specific example you can transform to pseudo-distance using 1-D. (The reason to do this would be because I don't know that Affinity Propagation will give you expected results if you give it a similarity matrix as input):
1-D
array([[ 0. , 0.91723747, 0.83772234, 0.52786405, 0.35424869],
[ 0.91723747, 0. , 0.43223564, 0.25543735, 0.90098049],
[ 0.83772234, 0.43223564, 0. , 0.52277442, 0.41742431],
[ 0.52786405, 0.25543735, 0.52277442, 0. , 0.12701665],
[ 0.35424869, 0.90098049, 0.41742431, 0.12701665, 0. ]])
With that being said, I think this is where your interpretation was off:
This says that the first 3-rows are similar, 4th row is a cluster on its own, and the 5th row is also a cluster on its own. Totally of 3 clusters.
The f.labels_ array:
array([0, 1, 1, 1, 0])
is telling you that samples (not rows) 0 and 4 are in cluster 0 AND that samples 2, 3, and 4 are in cluster 1. You don't need 25 different labels for a 5 sample problem, that wouldn't make sense. Hope this helps a little, try the demo (inspect the variables along the way and compare them with your data), which starts with raw data; it should help you decide if Affinity Propagation is the right clustering algorithm for you.
According to this page https://scikit-learn.org/stable/modules/clustering.html
you can use a similarity matrix for AffinityPropagation.

Categories