How does numpy.histogram() work? - python

While reading up on numpy, I encountered the function numpy.histogram().
What is it for and how does it work? In the docs they mention bins: What are they?
Some googling led me to the definition of Histograms in general. I get that. But unfortunately I can't link this knowledge to the examples given in the docs.

A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as "disjoint categories".)
The Numpy histogram function doesn't draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren't of equal width) of each bar.
In this example:
np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.
The input values are 1, 2 and 1. Therefore, bin "1 to 2" contains two occurrences (the two 1 values), and bin "2 to 3" contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).
Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:
a bar of height 0 for range/bin [0,1] on the X-axis,
a bar of height 2 for range/bin [1,2],
a bar of height 1 for range/bin [2,3].
You can plot this directly with Matplotlib (its hist function also returns the bins and the values):
>>> import matplotlib.pyplot as plt
>>> plt.hist([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]), <a list of 3 Patch objects>)
>>> plt.show()

import numpy as np
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = range(5))
Below, hist indicates that there are 0 items in bin #0, 2 in bin #1, 4 in bin #3, 1 in bin #4.
print(hist)
# array([0, 2, 4, 1])
bin_edges indicates that bin #0 is the interval [0,1), bin #1 is [1,2), ...,
bin #3 is [3,4).
print (bin_edges)
# array([0, 1, 2, 3, 4]))
Play with the above code, change the input to np.histogram and see how it works.
But a picture is worth a thousand words:
import matplotlib.pyplot as plt
plt.bar(bin_edges[:-1], hist, width = 1)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()

Another useful thing to do with numpy.histogram is to plot the output as the x and y coordinates on a linegraph. For example:
arr = np.random.randint(1, 51, 500)
y, x = np.histogram(arr, bins=np.arange(51))
fig, ax = plt.subplots()
ax.plot(x[:-1], y)
fig.show()
This can be a useful way to visualize histograms where you would like a higher level of granularity without bars everywhere. Very useful in image histograms for identifying extreme pixel values.

Related

Excluding rightmost edge in numpy.histogram

I have a list of numbers a and a list of bins which I shall use to bin the numbers in a using numpy.histogram. the bins are calculated from the mean and standard deviation (std) of a. So the number of bins is B, and the minimum value of the first bin is mean - std, the maximum of the last bin being mean + std. (The text in bold indicates my final goal)
An example goes like the following:
>>> a
array([1, 1, 3, 2, 2, 6])
>>> bins = np.linspace(mean - std, mean + std, B + 1)
array([ 0.79217487, 1.93072496, 3.06927504, 4.20782513]))
>>> numpy.histogram(a, bins = bins)[0]
(array([2, 3, 0], dtype=int32)
However, I want to exclude the rightmost edge of the last bin - i.e. if some value in a exactly equals mean + std, I do not wish to include it in the last bin. The caricature about mean and std is not important, excluding the rightmost edge (aka making it a half-open interval) is. The doc says, unfortunately in this regard:
All but the last (righthand-most) bin is half-open. In other words, if
bins is:
[1, 2, 3, 4] then the first bin is [1, 2) (including 1, but excluding
2) and the second [2, 3). The last bin, however, is [3, 4], which
includes 4.
Is there a simple solution I can employ? That is, one that does not involve manually fixing edges. That is something I can do, but that's not what I'm looking for. Is there a flag I can pass or a different method I can use?
Here's one (kind of crude?) way to turn the make the last bin half-open instead of closed. What I'm doing is subtracting the smallest possible value from the right side of the right-most bin:
a = np.array([1, 1, 3, 2, 2, 6])
B = 3 # (in this example)
bins = np.linspace(a.mean() - a.std(), a.mean() + a.std(), B + 1)
# array([ 0.79217487, 1.93072496, 3.06927504, 4.20782513]))
bins[-1] -= np.finfo(float).eps # <== this is the crucial line
np.histogram(a, bins = bins)
If you're using some other type other than float for the values in a, using a different type in the call to finfo. For example:
np.finfo(float).eps
np.finfo(np.float128).eps
Clip the array first. Do NOT use numpy.clip() function. it would just set out-bounded data to clip high/low value and counted into left bin and right bin. that would create high peaks show on both ends
Following code worked with me. My case is integer array, I guess should be ok with Float array.
clip_low = a.mean() - a.std() # I converted clip to int
clip_high = a.mean() + a.std() # should be ok with float
clip= a[ (clip_low <= a) & (a < clip_high) ] # != clip_high (Do NOT use np.clip() fuxntion
bins= clip_high - clip_low # use your bins #
hist, bins_edge= np.histogram( clip, bins=bins, range=(clip_low,clip_high))

Sum up data on specific (multiple) ranges

I'm certain there's a good way to do this but I'm blanking on the right search terms to google, so I'll ask here instead. My problem is this:
I have 2 2-dimensional array, both with the same dimensions. One array (array 1) is the accumulated precipitation at (x,y) points. The other (array 2) is the topographic height of the same (x,y) grid. I want to sum up array 1 between specific heights of array 2, and create a bar graph with topographic height bins a the x-axis and total accumulated precipitation on the y axis.
So I want to be able to declare a list of heights (say [0, 100, 200, ..., 1000]) and for each bin, sum up all precipitation that occurred within that bin.
I can think of a few complicated ways to do this, but I'm guessing there's probably an easier way that I'm not thinking of. My gut instinct is to loop through my list of heights, mask anything outside of that range, sum up remaining values, add those to a new array, and repeat.
I'm wondering is if there's a built-in numpy or similar library that can do this more efficiently.
This code shows what you're asking for, some explanation in comments:
import numpy as np
def in_range(x, lower_bound, upper_bound):
# returns wether x is between lower_bound (inclusive) and upper_bound (exclusive)
return x in range(lower_bound, upper_bound)
# vectorize allows you to easily 'map' the function to a numpy array
vin_range = np.vectorize(in_range)
# representing your rainfall
rainfall = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# representing your height map
height = np.array([[1, 2, 1], [2, 4, 2], [3, 6, 3]])
# the bands of height you're looking to sum
bands = [[0, 2], [2, 4], [4, 6], [6, 8]]
# computing the actual results you'd want to chart
result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands]
print(result)
The next to last line is where the magic happens. vin_range(height, *band) uses the vectorized function to create a numpy array of boolean values, with the same dimensions as height, that has True if a value of height is in the range given, or False otherwise.
By using that array to index the array with the target values (rainfall), you get an array that only has the values for which the height is in the target range. Then it's just a matter of summing those.
In more steps than result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands] (but with the same result):
result = []
for lower, upper in bands:
include = vin_range(height, lower, upper)
values_to_include = rainfall[include]
sum_of_rainfall = sum(values_to_include)
result.append(([lower, upper], sum_of_rainfall))
You can use np.bincount together with np.digitize. digitize creates an array of bin indices from the height array height and the bin boundaries bins. bincount then uses the bin indices to sum the data in array rain.
# set up
rain = np.random.randint(0,100,(5,5))/10
height = np.random.randint(0,10000,(5,5))/10
bins = [0,250,500,750,10000]
# compute
sums = np.bincount(np.digitize(height.ravel(),bins),rain.ravel(),len(bins)+1)
# result
sums
# array([ 0. , 37. , 35.6, 14.6, 22.4, 0. ])
# check against direct method
[rain[(height>=bins[i]) & (height<bins[i+1])].sum() for i in range(len(bins)-1)]
# [37.0, 35.6, 14.600000000000001, 22.4]
An example using the numpy ma module which allows to make masked arrays. From the docs:
A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
which seems what you need in this case.
import numpy as np
pr = np.random.randint(0, 1000, size=(100, 100)) #precipitation map
he = np.random.randint(0, 1000, size=(100, 100)) #height map
bins = np.arange(0, 1001, 200)
values = []
for vmin, vmax in zip(bins[:-1], bins[1:]):
#creating the masked array, here minimum included inside bin, maximum excluded.
maskedpr = np.ma.masked_where((he < vmin) | (he >= vmax), pr)
values.append(maskedpr.sum())
values is the list of values for each bin, which you can plot.
The numpy.ma.masked_where function returns an array masked where condition is True. So you need to set the condition to be True outside the bins.
The sum() method performs the sum only where the array is not masked.

Is it possible to update / add to a numpy histogram (specifically, numpy.histogram2d) that is already populated?

I have already populated a numpy.histogram2d with a pair of lists (x0,y0). Can I now augment the histogram with an additional pair of two lists (x1,y1) so that the histogram contains both (x0,y0) and (x1,y1)?
The relevant and official documentation is here:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram2d.html
On this page I only see parameters and returns, but not functions that this object supports. How can I find all the supported functions?
np.histogram2D is not an object as pointed out in the comments. It is a function that returns an array with bin values as well as two for the bin edges. Nonetheless, as long as you do not compute a normed histogram, you can simply add to the histogram with the same bins. For example, to extend the example from the np.histogram2d documentation:
import numpy as np
x = np.random.normal(3, 1, 100)
y = np.random.normal(1, 1, 100)
xedges = [0, 1, 1.5, 3, 5]
yedges = [0, 2, 3, 4, 6]
H, xedges, yedges = np.histogram2d(x, y, bins=(xedges, yedges))
x2 = np.random.normal(3, 1, 100)
y2 = np.random.normal(1, 1, 100)
H += np.histogram2d(x2, y2, bins=(xedges, yedges))[0]
This will give you the added combined bin values in H with bin edges xedges and yedges.
Just as a bit of self-promotion, if you want to use updateable histograms and object-like histogram behaviour in general, you can try a library that I wrote, https://github.com/janpipek/physt.

Return row/column values from array underneath lines

What I'm trying to do is generate multiple lines on a binary image based on a length and angle. Then return all of the row/column values along with pixel values underneath those lines and place them into a python list.
To generate those lines I wrote a function that outputs start and end coordinates of each line. Using these coordinates I want to generate the lines and extract the values.
To extract values from a horizontal line from pixel (0,1) to (3,1) I can do:
a = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
pixels = a[1, 0:3]
or vertical:
pixels = a[0:3, 1]
which returns an array of all the pixel values underneath that line:
array([3, 4, 5])
array([1, 4, 7])
How could I apply this method on lines with an angle? so with an x1,y1 and x2,y2? These return (syntax) errors:
a([0,0], [2,2])
a([0,0]:[2,2])
a[0,0:2,2]
I'm looking for something similiar as 'improfile' in Matlab.
Many thanks for your help!
You can use scikits-image's draw module and the draw.line method:
>>> from skimage.draw import line
>>> y0 = 1; x0 = 1; y1 = 10; x1 = 10;
>>> rr, cc = line(y0, x0, y1, x1)
rr and cc will contain row and column indexes of the values from which the line passes. You can access those values as:
>>> values = img[rr, cc]
Assuming img is the name of your image.
Note that this implementation does not offer interpolation or subpixel accuracy for angles different from 45 degree intervals. It will create a discrete stepped line between points A and B that passes through whole pixels.

Python: NxM array of samples drawn from NxM normal distributions

I have two 2D arrays (or of higher dimension), one that defines averages (M) and one that defines standard deviations (S). Is there a python library (numpy, scipy, ...?) that allows me to generate an array (X) containing samples drawn from the corresponding distributions?
In other words: each entry xij is a sample that comes from the normal distribution defined by the corresponding mean mij and standard deviation sij.
Yes numpy can help here:
There is a np.random.normal function that accepts array-like inputs:
import numpy as np
means = np.arange(10) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
stddevs = np.ones(10) # [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
samples = np.random.normal(means, stddevs)
array([-1.69515214, -0.20680708, 0.61345775, 2.98154162, 2.77888087,
7.22203785, 5.29995343, 8.52766436, 9.70005434, 9.58381479])
even if they are multidimensional:
means = np.arange(10).reshape(2,5) # make it multidimensional with shape 2, 5
stddevs = np.ones(10).reshape(2,5)
samples = np.random.normal(means, stddevs)
array([[-0.76585438, 1.22226145, 2.85554809, 2.64009423, 4.67255324],
[ 3.21658151, 4.59969355, 6.87946817, 9.14658687, 8.68465692]])
The second one has a shape of (2,5)
In case you want only different means but the same standard deviation you can also only pass one array and one scalar and still get an array with the right shape:
means = np.arange(10)
samples = np.random.normal(means, 1)
array([ 0.54018686, -0.35737881, 2.08881115, 3.08742942, 4.4426366 ,
3.6694955 , 5.27515536, 8.68300816, 8.83893819, 7.71284217])

Categories