When using numpy.histogram, with density=True, the function returns an array with the pdf values at each point. However my question is, does it return pdf values at the leading edge of the bin or in the middle of the bin?
For example, if I have bins 0-1, 1-2, 2-3 etc... will it give me the pdfs at the points 0, 1, 2 etc... or at 0.5, 1.5, 2.5 etc...
Thank you!
Each normalized histogram value will give you the estimated probability density for your sample over the range spanned by its corresponding bin edges. If you had bin edges a and b then the corresponding normalized histogram value would be the probability density over the interval [a,b).
Intuitively, to estimate density from some finite number of samples you count the number of samples that fall into each histogram bin, then divide by the area of the bin. For infinitely many samples and infinitely small bins this would eventually converge on the PDF of the underlying continuous distribution.
Related
I have a 50 by 50 grid of evenly spaced (x,y) points. Each of these points has a third scalar value. This can be visualized using a contourplot which I have added. I am interested in the regions indicated in by the red circles. These regions of low "Z-values" are what I want to extract from this data.
2D contour plot of 50 x 50 evenly spaced grid points:
I want to do this by using clustering (machine learning), which can be lightning quick when applied correctly. The problem is, however, that the points are evenly spaced together and therefore the density of the entire dataset is equal everywhere.
I have tried using a DBSCAN algorithm with a custom distance metric which takes into account the Z values of each point. I have defined the distance between two points as follows:\
def custom_distance(point1,point2):
average_Z = (point1[2]+point2[2])/2
distance = np.sqrt(np.square((point1[0]-point2[0])) + np.square((point1[1]-point2[1])))
distance = distance * average_Z
return distance
This essentially determines the Euclidean distance between two points and adds to it the average of the two Z values of both points. In the picture below I have tested this distance determination function applied in a DBSCAN algorithm. Each point in this 50 by 50 grid each has a Z value of 1, except for four clusters that I have randomly placed. These points each have a z value of 10. The algorithm is able to find the clusters in the data based on their z value as can be seen below.
DBSCAN clustering result using scalar value distance determination:
Positive about the results I tried to apply it to my actual data, only to be disappointed by the results. Since the x and y values of my data are very large, I have simply scaled them to be 0 to 49. The z values I have left untouched. The results of the clustering can be seen in the image below:
Clustering result on original data:
This does not come close to what I want and what I was expecting. For some reason the clusters that are found are of rectangular shape and the light regions of low Z values that I am interested in are not extracted with this approach.
Is there any way I can make the DBSCAN algorithm work in this way? I suspect the reason that it is currently not working has something to do with the differences in scale of the x,y and z values. I am also open for tips or recommendations on other approaches on how to define and find the lighter regions in the data.
I would like to associate a probability value to a number.
Let's say, I consider a norman probability distribution with mean = 7 and std = 3.
I can generate a random number based on such distribution in this way
np.random.normal(7, 3, 1)
I would like to find a method to associate to a given number the value of the probability associated to it.
For instance, what is the value of the probability associated with 0.6 based on such distribution?
Let's assume I generate the histogram of n random values.
x = np.random.normal(7, 3, 100000)
plt.hist(x, 10)
Here I can I see that a value of 5 has a probability of ~0.11 while a value of 20 has probability 0.
For any normalized continuous distribution represented on a histogram as you have above, the only way to find the probability for a given histogram bin is to take the integral of that distribution over the range of the bins. So this depends on:
The distribution
The range of the bin you are considering
You can use the scipy package for example to do this integral numerically for you.
https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html
If you need something more simple, you can approximate this probability by taking the value of the CDF at the center of the bin and multiplying by the width of the bin.
Following on from the post found here: 2D histogram coloured by standard deviation in each bin
I would like to colour each bin in a 2D grid by the fraction of points whose label values are below a certain threshold in Python.
Note that, in this dataset, each point has a continuous label value between 0-1.
For example here is a histogram I made whereby the colour denotes the standard deviation of label values of all points in each bin:
The way this was done was by using
scipy.stats.binned_statistic_2d()
(see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)
..and setting the statistic argument to 'std'
But is there a way to change this kind of plot so that the colouring is representative of the fraction of points in each bin with label value below 0.5 for example?
It could be that the only way to do this is by explicitly defining a grid of some kind and calculating the fractions but I'm not sure of the best way to do that so any help on this matter would be greatly appreciated!
Maybe using scipy.stats.binned_statistic_2d or numpy.histogram2d and being able to return the raw data values in each bin as a multi dimensional array would help in being able to quickly compute the fractions explicitly.
The fraction of elements in an array below a threshold can be calculated as
fraction = lambda a, threshold: len(a[a<threshold])/len(a)
Hence you can call
scipy.stats.binned_statistic_2d(x, y, values, statistic=lambda a: fraction(a, 0.5))
Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.
Here is a MWE to display a histogram with normally distributed sample data:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()
Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.
I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.
Density Functions
This will do what you want for a density function:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)
Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.
Histograms
Here is a solution that gives approximately equal area boxes for a histogram:
def histedges_equalA(x, nbin):
pow = 0.5
dx = np.diff(np.sort(x))
tmp = np.cumsum(dx ** pow)
tmp = np.pad(tmp, (1, 0), 'constant')
return np.interp(np.linspace(0, tmp.max(), nbin + 1),
tmp,
np.sort(x))
n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)
These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.
Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).
The issue with equal-area histograms
As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:
Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.
I have a grid of model spectra, which have a constant, very high spectral resolution, and I need to down-sample them to a lower resolution, while preserving the total number of counts.
In essence, if the first 5 bins have (nominal center-of-bin) wavelengths [7.8, 7.81, 7.82, 7.83, 7.84], and the values [1.01, 1.02, 1.015, 1.014, 1.02], and my desired bins are some (non-integer) factor (say, 2.5) times as wide, I want my new spectrum to have nominal wavelengths [7.81, 7.83] and values [1.01+1.02+0.5*1.015, 0.5*1.015+1.014+1.02] (in general, though, the bins are not lined up as well, so you may get fractions of bins on either side).
I'll call my grid spec_ssp, and it has a shape of (93, 16, 39848). Wavelength varies along axis 2, and the first two axes are other parameters. I also have the nominal (central) wavelengths for each wavelength bin (technically, they're the log of the wavelength, but that shouldn't matter), called logL_ssp, and the desired new spacing of the logL grid, dlogL_new. I can figure out the nominal logL spacing of my templates dlogL_ssp by calculating np.median(logL_ssp[1:] - logL_ssp[:-1]), and it's about 20% the desired logL spacing. We'll call that fraction f.
I originally tried to use scipy.ndimage.zoom, using the aforementioned factor f, but discovered that it gives me an array that's downsampled by a factor of exactly 4. I need an exact resampling, so this won't work.
Next, I tried linearly interpolating along axis 2 using np.interp1d, after setting up new bin limits, with the aim of integrating the spectra in my grid using np.integrate.quad between successive bin limits, effectively getting an estimate of the total light in each of my new bins, more or less rigorously. However, quad doesn't play nicely with interp1d's interpolators (quad doesn't like non-float inputs). And, since I have ~1500 model spectra, the whole thing takes ages to run while iterating over all three axes (yes, I'm only making a new interpolator once per model spectrum).
Any ideas how to tackle this?