Numpy cumulative distribution function (CDF) - python

I have an array of values and have created a histogram of the data using numpy.histogram, as follows:
histo = numpy.histogram(arr, nbins)
where nbins is the number of bins derived from the range of the data (max-min) divided by a desired bin width.
From the output I create a cumulative distribution function using:
cdf = np.cumsum(histo[0])
normCdf = cdf/np.amax(cdf)
However, I need an array of normCdf values that corresponds with the values in the original array (arr). For example, if a value in the original array arr is near the minimum value of arr then its corresponding normCdf value will be high (i.e 0.95). (In this example, as I am working with radar data my data is in decibels and is negative. Therefore the lowest value is where the CDF reaches its maximum.)
Im struggling, conceptually, how I achieve an array whereby each value in the array has its corresponding value under the CDF (normCdf value). Any help would be appreciated. The histogram with the cdf is below.

This is old, but may still be of help to someone.
Consider the OP's last sentence:
Im struggling, conceptually, how I achieve an array whereby each value in the array has its corresponding value under the CDF (normCdf value).
If I understand correctly, what the OP is asking for, actually boils down to the (normalized) ordinal rank of the array elements.
The ordinal rank of an array element i basically indicates how many elements in the array have a value smaller than that of element i. This is equivalent to the discrete cumulative density.
Ordinal ranking is related to sorting by the following equality (where u is an unsorted list):
u == [sorted(u)[i] for i in ordinal_rank(u)]
Based on the implementation of scipy.stats.rankdata, the ordinal rank can be computed as follows:
def ordinal_rank(data):
rank = numpy.empty(data.size)
rank[numpy.argsort(data)] = numpy.arange(data.size)
return rank
So, to answer the OP's question:
The normalized (empirical) cumulative density corresponding to the values in the OP's arr can then be computed as follows:
normalized_cdf = ordinal_rank(arr) / len(arr)
And the result can be displayed using:
pyplot.plot(arr, normalized_cdf, marker='.', linestyle='')
Note, that, if you only need the plot, there is an easier way:
n = len(arr)
pyplot.plot(numpy.sort(arr), numpy.arange(n) / n)
And, finally, we can verify this by plotting the cumulative normalized histogram as follows (using an arbitrary number of bins):
pyplot.hist(arr, bins=100, cumulative=True, density=True)
Here's an example comparing the three approaches, using 30 bins for the cumulative histogram:

Related

scipy.stats.wasserstein_distance implementation

I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!
The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.

Python: How to find the n-th quantile of a 2-d distribution of points

I have a 2D-distribution of points (roughly speaking, two np.arrays, x and y) as shown in the figure attached.
How can I select the points of the distribution that are part of the n-th quantile of such distribution?
I finally came out with a solution, which doesn't look like the most elegant possible, but it worked reasonably well:
To estimate quantiles of a 2 dimensional distribution one can use the scipy function binned_statistics which allows to bin the data in
one of the and calculate some statistics in the other.
Here is the documentation of such function:
https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.binned_statistic.html
Which syntax is:
scipy.stats.binned_statistic(x, values, statistic='mean', bins=10, range=None)
First, one might chose the number of bins to use, for example Nbins=100.
Next, one might define an user function to be put as input
(here is an example on how to do so:
How to make user defined functions for binned_statistic), which is my case is an function that estimates the n-th percentile of the data in that bin (I called it myperc). Finally a function is defined such as it takes x, y, Nbins and nth (the percentile desired) and returns the binned_statistics gives 3 outputs: statistic (value of the desired statistics in that bin),bin_edges,binnumber (in which bin your data-point is), but also the values of x in center of the bin (bin_center)
def quantile2d(x,y,Nbins,nth):
from numpy import percentile
from scipy.stats import binned_statistic
def myperc(x,n=nth):
return(percentile(x,n))
t=binned_statistic(x,y,statistic=myperc,bins=Nbins)
v=[]
for i in range(len(t[0])): v.append((t[1][i+1]+t[1][i])/2.)
v=np.array(v)
return(t,v)
So v and t.statistic will give x and y values for a curve defining the desired percentile respectively.
Nbins=100
nth=30.
t,v=me.quantile2d(x,y,Nbins,nth)
ii=[]
for i in range(Nbins):
ii=ii+np.argwhere(((t.binnumber==i) & (y<t.statistic[i]))).flatten().tolist()
ii=np.array(ii,dtype=int)
Finally, this gives the following plot:
plt.plot(x,y,'o',color='gray',ms=1,zorder=1)
plt.plot(v,t.statistic,'r-',zorder=3)
plt.plot(x[ii],y[ii],'o',color='blue',ms=1,zorder=2)
in which the line for the 30-th percentile is shown in red, and the data under this percentile is shown in blue.

Converting discrete values into real values in python

I have a numpy array with discrete values. I used numpy.digitize() function to get these discrete values from continuous values. Now I want to convert these discrete values back to the original continuous values. Is there a function in python which can help me doing that? A sample code has been added below:
A = [437.479, 438.536, 440.026,............,471.161]
bins = numpy.linspace(numpy.amin(A),numpy.amax(A),255)
discretized_A = numpy.digitize(A, bins)
discretized_A = [1,8,18,................,237]
As you see here I had a vector of real values, I used digitize function to project that vector in space of min to max of A with 255 equal spacing values. So i got the end result as discretized_A. Now I want to reverse engineer the steps and get my original real values.

Cumulative Distribution Function from arbitrary Probability Distribution Function

I'm trying to plot a Probability Distribution Function for a given set of data from a csv file
import numpy as np
import math
import matplotlib.pyplot as plt
data=np.loadtxt('data.csv',delimiter=',',skiprows=1)
x_value1= data[:,1]
x_value2= data[:,2]
weight1= data[:,3]
weight2= data[:,4]
where weight1 is an array of data that represents the weight for data in x_value1 and weight2 represents the same for x_value2. I produce a histogram where I put the weights in the parameter
plt.hist(x_value1,bins=40,color='r', normed=True, weights=weight1, alpha=0.8, label='x_value1')
plt.hist(x_value2, bins=40,color='b', normed=True, weights=weight2, alpha=0.6, label='x_value2')
My problem now is converting this PDF to CDF. I read from one of the posts here that you can use numpy.cumsum() to convert a set of data to CDF, so I tried it together with np.histogram()
values1,base1= np.histogram(x_value1, bins=40)
values2,base2= np.histogram(x_value2, bins=40)
cumulative1=np.cumsum(values1)
cumulative2=np.cumsum(values2)
plt.plot(base1[:-1],cumulative1,c='red',label='x_value1')
plt.plot(base2[:-1],cumulative2,c='blue',label='x_value2')
plt.title("CDF for x_value1 and x_value2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I don't know if this plot is right because I didn't include the weights (weight1 and weight2) while doing the CDF. How can I include the weights while plotting the CDF?
If I understand your data correctly, you have a number of samples which have some weight associated with them. Maybe what you want is the experimental CDF of the sample.
The samples are in vector x and weights in vector w. Let us first construct a Nx2 array of them:
arr = np.column_stack((x,w))
Then we will sort this array by the samples:
arr = arr[arr[:,0].argsort()]
This sorting may look a bit odd, but argsort gives the sorted order (0 for the smallest, 1 for the second smallest, etc.). When the two-column array is indexed by this result, the rows are arranged so that the first column is ascending. (Using only sort with axis=0 does not work, as it sorts both columns independently.)
Now we can create the cumulative fraction by taking the cumulative sum of weights:
cum = np.cumsum(arr[:,1])
This must be normalized so that the full scale is 1.
cum /= cum[-1]
Now we can plot the cumulative distribution:
plt.plot(arr[:,0], cum)
Now X axis is the input value and Y axis corresponds to the fraction of samples below each level.

Get original data array from probability density values and bins of numpy histogram

My purpose is to calculate the original data array from the infromation of probability density and bins of np.histogram function.
For example:
import random
a = random.sample(xrange(100), 50)
n, bin = np.histogram(a,bins=100,range=(-10,10), normed=True)
I would like to get a from n and bin. I used np.digitize, but it doesn;t seem to be proper solution.
Actually my orginal purpose is to calculate skewness and kurtosis of original dat from this histogram. So, I have tried to convert n and bin to original data. If I can get skewness and kurtosis from the histogram directly, it would be perfect.
Thanks to user3823992, I tried scipy.stats.rv_discrete function to get skewness and kurtosis from bins and probability density function.
My edited code is:
a = random.sample(xrange(100), 50)
n, bin = np.histogram(a,bins=100,range=(-10,10), normed=True)
b2=bin[:-1]
print np.mean(a), np.var(a), sp.skew(a),sp.kurtosis(a)
dist = sp.rv_discrete(values=(b2,n))
print dist.stats(moments='mvsk')
However, the results from np.mean(a), np.var(a), sp.skew(a),sp.kurtosis(a) and dist.stats(moments='mvsk') are too much different. According to document for scipy.stats.rv_discrete, one of two tuples in 'values' should be the points with integers (in this case, b2) and sum of the other (in this case, n) should be 1.
The problem is the numbers in my b2 are not integers and the sum of 'n' is not also 1.
I multiplied the bin width to n and tried again. However, still didn't work.
Any idea or help would be appreciated.
Best regards,
Hoonill
scipy.stats.rv_discrete has you covered. It'll help you make a random distribution class from your data. The result will have a whole slew of handy methods. The .stats method will give you the first four moments. If you don't specify, it'll just return mean (m) and variance (v).
b2=bin[:-1]
print mean(a), var(a), scipy.stats.skew(a)
dist = scipy.stats.rv_discrete(values=(b2,n))
print dist.stats(moments='mvsk')
The above should be compatible with your code. Just reorganize to make use of the output.

Categories