I have a list of numbers a and a list of bins which I shall use to bin the numbers in a using numpy.histogram. the bins are calculated from the mean and standard deviation (std) of a. So the number of bins is B, and the minimum value of the first bin is mean - std, the maximum of the last bin being mean + std. (The text in bold indicates my final goal)
An example goes like the following:
>>> a
array([1, 1, 3, 2, 2, 6])
>>> bins = np.linspace(mean - std, mean + std, B + 1)
array([ 0.79217487, 1.93072496, 3.06927504, 4.20782513]))
>>> numpy.histogram(a, bins = bins)[0]
(array([2, 3, 0], dtype=int32)
However, I want to exclude the rightmost edge of the last bin - i.e. if some value in a exactly equals mean + std, I do not wish to include it in the last bin. The caricature about mean and std is not important, excluding the rightmost edge (aka making it a half-open interval) is. The doc says, unfortunately in this regard:
All but the last (righthand-most) bin is half-open. In other words, if
bins is:
[1, 2, 3, 4] then the first bin is [1, 2) (including 1, but excluding
2) and the second [2, 3). The last bin, however, is [3, 4], which
includes 4.
Is there a simple solution I can employ? That is, one that does not involve manually fixing edges. That is something I can do, but that's not what I'm looking for. Is there a flag I can pass or a different method I can use?
Here's one (kind of crude?) way to turn the make the last bin half-open instead of closed. What I'm doing is subtracting the smallest possible value from the right side of the right-most bin:
a = np.array([1, 1, 3, 2, 2, 6])
B = 3 # (in this example)
bins = np.linspace(a.mean() - a.std(), a.mean() + a.std(), B + 1)
# array([ 0.79217487, 1.93072496, 3.06927504, 4.20782513]))
bins[-1] -= np.finfo(float).eps # <== this is the crucial line
np.histogram(a, bins = bins)
If you're using some other type other than float for the values in a, using a different type in the call to finfo. For example:
np.finfo(float).eps
np.finfo(np.float128).eps
Clip the array first. Do NOT use numpy.clip() function. it would just set out-bounded data to clip high/low value and counted into left bin and right bin. that would create high peaks show on both ends
Following code worked with me. My case is integer array, I guess should be ok with Float array.
clip_low = a.mean() - a.std() # I converted clip to int
clip_high = a.mean() + a.std() # should be ok with float
clip= a[ (clip_low <= a) & (a < clip_high) ] # != clip_high (Do NOT use np.clip() fuxntion
bins= clip_high - clip_low # use your bins #
hist, bins_edge= np.histogram( clip, bins=bins, range=(clip_low,clip_high))
Related
I have two 3D arrays mean and std, containing respectively, as their names states, mean values and standard deviation. Both arrays have same shape, so that there is correspondence between mean value and standard deviation at each single position in both these tables. I would like to, for each position of the array, use the value in mean and corresponding value in std to define a truncated normal distribution from which I draw a single value that I store at the corresponding position in another array p that has the same shape as mean and std.
Of course, I thought of using scipy.stats.truncnorm but I encounter broadcasting problems and I am a bit lost on how to use it elegantly. A for loop would take too much time as the aim is to apply this process to very big arrays.
As a simple example, let us consider
mean = [[[4 0]
[1 3]]
[[3 1]
[3 4]]]
std = [[[0.84700368 0.78628226]
[0.54893714 0.68086502]]
[[0.23237688 0.46543749]
[0.01420151 0.25461322]]]
For simplicity, I initialize p as an array containing indices:
p = [[[1 2]
[3 4]]
[[5 6]
[7 8]]]
For instance, I would like to replace value 5 in p by a value randomly drawn from a truncated normal distribution (say truncated between user-chosen values lower and upper) of mean value 3 and standard deviation 0.23237688, as given at corresponding position in mean and std. The aim is to apply this process to all values at once.
Thank you in advance for your answers !
It's easier than you think.
mean = np.array([[[4, 0],
[1, 3]],
[[3, 1],
[3, 4]]])
std = np.array([[[0.84700368, 0.78628226],
[0.54893714, 0.68086502]],
[[0.23237688, 0.46543749],
[0.01420151, 0.25461322]]])
lower = 1
upper = 3
# from the documentation of truncnorm:
a, b = (lower - mean) / std, (upper - mean) / std
from scipy.stats import truncnorm
# remove random_state from parameters if you don't want reproducible
# results.
p = truncnorm.rvs(a, b, loc=mean, scale=std, random_state=1)
print(np.around(p, 2))
# output:
[[[2.6 1.5 ]
[1. 2.3 ]]
[[2.66 1.05]
[2.98 2.94]]]
I have two numpy.ndarray like this, but with many more rows:
A = numpy.array([[7.087, 0.038, -130.550],
[0.073, 1.224, -13.257]])
B = numpy.array([[20.047, -0.038, -12.551],
[16.073, 1.224, 13.257]])
Each row is one point, each element is x, y, z in the space.
How can I determine how many points are closer than 2 cm?
I have though different ways to solve this problem. One could be creating a sphere with radius = 2. If one point of A is in the sphere of one point of B, means they are closer than 2 cm.
I think I could create a program to solve this but I am not sure how I can know the number of the total points when one point of A is closer than 2 or more points of B because what I want is the total number of points closer I mean the sum of A and B that are closer than 2 cm.
Here is a vectorized approach using numpy. It creates an N_a-by-3-by-N_b array that is basically the Cartesian product of all elements of a and b. Then, it uses the numpy.linalg.norm function to compute the Euclidean norm over the axis corresponding to length 3.
import numpy
a = numpy.array([
[7.9, 0.0, -130.6],
[0.1, 1.2, -13.3]
])
b = numpy.array([
[20.0, -0.0, -12.6],
[16.1, 1.2, 13.3],
[ 0.5, 1.5, -12.0],
[ 8.0, 1.0, -131.0],
])
RADIUS = 2.0
d = a[:, :, numpy.newaxis] - b[:, :, numpy.newaxis].T # shape: (2, 3, 4)
close = numpy.linalg.norm(d, ord=2, axis=1) < RADIUS
# Result:
# array([[False, False, False, True],
# [False, False, True, False]])
close is a Boolean array, indexed by the indices of a and b respectively, where close[i][j] indicates whether a[i] and b[j] are close points.
To count the number of close points, you can simply sum over the appropriate axis:
a_close_count = numpy.sum(close, axis=-1)
# Result:
# array([1, 1])
b_close_count = numpy.sum(close, axis=0)
# Result:
# array([0, 0, 1, 1])
If you just need to check whether a point in a or b is close to any point in the other array, you can replace numpy.sum with numpy.any in the expressions above.
If I am right, you want to count the pairs made of a point of A and a point of B that are closer than a given distance.
The easiest solution is by brute force: try all possible pairs in turn (there are #A.#B of them) and increment a counter when the distance is short enough. If #A.#B is reasonable, this is quite acceptable.
If there are many points, I would recommend storing the points of B in a kD-tree (here k=3) for efficient queries. Then for every point of A, perform a "fixed-radius near-neighbor search". This will reduce the running time from #A.#B to roughly #A.(log(#B)+n) where n is the average number of close neighbors of a point.
As the "close to" relation is symmetric, there is no need to search A from B.
I guess I'd go for a scipy distance matrix: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html
I'm certain there's a good way to do this but I'm blanking on the right search terms to google, so I'll ask here instead. My problem is this:
I have 2 2-dimensional array, both with the same dimensions. One array (array 1) is the accumulated precipitation at (x,y) points. The other (array 2) is the topographic height of the same (x,y) grid. I want to sum up array 1 between specific heights of array 2, and create a bar graph with topographic height bins a the x-axis and total accumulated precipitation on the y axis.
So I want to be able to declare a list of heights (say [0, 100, 200, ..., 1000]) and for each bin, sum up all precipitation that occurred within that bin.
I can think of a few complicated ways to do this, but I'm guessing there's probably an easier way that I'm not thinking of. My gut instinct is to loop through my list of heights, mask anything outside of that range, sum up remaining values, add those to a new array, and repeat.
I'm wondering is if there's a built-in numpy or similar library that can do this more efficiently.
This code shows what you're asking for, some explanation in comments:
import numpy as np
def in_range(x, lower_bound, upper_bound):
# returns wether x is between lower_bound (inclusive) and upper_bound (exclusive)
return x in range(lower_bound, upper_bound)
# vectorize allows you to easily 'map' the function to a numpy array
vin_range = np.vectorize(in_range)
# representing your rainfall
rainfall = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# representing your height map
height = np.array([[1, 2, 1], [2, 4, 2], [3, 6, 3]])
# the bands of height you're looking to sum
bands = [[0, 2], [2, 4], [4, 6], [6, 8]]
# computing the actual results you'd want to chart
result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands]
print(result)
The next to last line is where the magic happens. vin_range(height, *band) uses the vectorized function to create a numpy array of boolean values, with the same dimensions as height, that has True if a value of height is in the range given, or False otherwise.
By using that array to index the array with the target values (rainfall), you get an array that only has the values for which the height is in the target range. Then it's just a matter of summing those.
In more steps than result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands] (but with the same result):
result = []
for lower, upper in bands:
include = vin_range(height, lower, upper)
values_to_include = rainfall[include]
sum_of_rainfall = sum(values_to_include)
result.append(([lower, upper], sum_of_rainfall))
You can use np.bincount together with np.digitize. digitize creates an array of bin indices from the height array height and the bin boundaries bins. bincount then uses the bin indices to sum the data in array rain.
# set up
rain = np.random.randint(0,100,(5,5))/10
height = np.random.randint(0,10000,(5,5))/10
bins = [0,250,500,750,10000]
# compute
sums = np.bincount(np.digitize(height.ravel(),bins),rain.ravel(),len(bins)+1)
# result
sums
# array([ 0. , 37. , 35.6, 14.6, 22.4, 0. ])
# check against direct method
[rain[(height>=bins[i]) & (height<bins[i+1])].sum() for i in range(len(bins)-1)]
# [37.0, 35.6, 14.600000000000001, 22.4]
An example using the numpy ma module which allows to make masked arrays. From the docs:
A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
which seems what you need in this case.
import numpy as np
pr = np.random.randint(0, 1000, size=(100, 100)) #precipitation map
he = np.random.randint(0, 1000, size=(100, 100)) #height map
bins = np.arange(0, 1001, 200)
values = []
for vmin, vmax in zip(bins[:-1], bins[1:]):
#creating the masked array, here minimum included inside bin, maximum excluded.
maskedpr = np.ma.masked_where((he < vmin) | (he >= vmax), pr)
values.append(maskedpr.sum())
values is the list of values for each bin, which you can plot.
The numpy.ma.masked_where function returns an array masked where condition is True. So you need to set the condition to be True outside the bins.
The sum() method performs the sum only where the array is not masked.
I want to compute a histogram of the differences between all the elements in one array A with all the elements in another array B.
So I want to have a histogram of the following data:
Delta1 = A1-B1
Delta2 = A1-B2
Delta3 = A1-B3
...
DeltaN = A2-B1
DeltaN+1 = A2-B2
DeltaN+2 = A2-B3
...
The point of this calculation is to show that these data has a correlation, even though not every data point has a "partner" in the other array and the correlation is rather noisy in practice.
The problem is that these files are in practice very large, several GB and all entries of the vectors are 64 bit integer numbers with very large differences.
It seems unfeasible to me to convert these data to binary arrays in order to be able to use correlation functions and fourier transforms to compute this.
Here is a small example to give a better taste of what I'm looking at.
This implementation with numpy's searchsorted in a for loop is rather slow.
import numpy as np
import matplotlib.pyplot as plt
timetagsA = [668656283,974986989,1294941174,1364697327,\
1478796061,1525549542,1715828978,2080480431,2175456303,2921498771,3671218524,\
4186901001,4444689281,5087334517,5467644990,5836391057,6249837363,6368090967,8344821453,\
8933832044,9731229532]
timetagsB = [13455,1294941188,1715828990,2921498781,5087334530,5087334733,6368090978,9731229545,9731229800,9731249954]
max_delta_t = 500
nbins = 10000
histo=np.zeros((nbins,2), dtype = float)
histo[:,0]=np.arange(0,nbins)
for i in range(0,int(len(timetagsA))):
delta_t = 0
j = np.searchsorted(timetagsB,timetagsA[i])
while (np.round(delta_t) < max_delta_t and j<len(timetagsB)):
delta_t = timetagsB[j] - timetagsA[i]
if(delta_t<max_delta_t):
histo[int(delta_t),1]+=1
j = j+1
plt.plot(histo[0:50,1])
plt.show()
It would be great if someone could help me to find a faster way to compute this. Thanks in advance!
EDIT
The below solution is supposing that your data is so huge that you can not use np.substract with np.outer and then slice the value you want to keep:
arr_diff = np.subtract.outer(arrB, arrA)
print (arr_diff[(0<arr_diff ) &(arr_diff <max_delta_t)])
# array([ 14, 12, 10, 13, 216, 11, 13, 268], dtype=int64)
with your example data it works but not with too huge data set
ORIGINAL SOLUTION
Let's first suppose your max_delta_t is smaller than the difference between two successive values in timetagsB for an easy way of doing it (then we can try to generalize it).
#create the array instead of list
arrA = np.array(timetagsA)
arrB = np.array(timetagsB)
max_delta_t = np.diff(arrB).min() - 1 #here it's 202 just for the explanation
You can use np.searchsorted in a vectorize way:
# create the array of search
arr_search = np.searchsorted(arrB, arrA) # the position of each element of arrA in arrB
print (arr_search)
# array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 6, 6, 6, 6, 7, 7, 7],dtype=int64)
You can calculate the difference between the element of arrB corresponding to each element of arrA by slicing arrB with arr_search
# calculate the difference
arr_diff = arrB[arr_search] - arrA
print (arr_diff[arr_diff<max_delta_t]) # finc the one smaller than max_delta_t
# array([14, 12, 10, 13, 11, 13], dtype=int64)
So what you are looking for is then calculated by np.bincount
arr_bins = np.bincount(arr_diff[arr_diff<max_delta_t])
#to make it look like histo but not especially necessary
histo = np.array([range(len(arr_bins)),arr_bins]).T
Now the problem is that, there is some values of difference between arrA and arrB that could not be obtained with this method, when max_delta_t is bigger than two successive values in arrB. Here is one way, naybe not the most efficient depending on the values of your data. For any value of max_delta_t
#need an array with the number of elements in arrB for each element of arrA
# within a max_delta_t range
arr_diff_search = np.searchsorted(arrB, arrA + max_delta_t)- np.searchsorted(arrB, arrA)
#do a loop to calculate all the values you are interested in
list_arr = []
for i in range(arr_diff_search.max()+1):
arr_diff = arrB[(arr_search+i)%len(arrB)][(arr_diff_search>=i)] - arrA[(arr_diff_search>=i)]
list_arr.append(arr_diff[(0<arr_diff)&(arr_diff<max_delta_t)])
Now you can np.concatenate the list_arr and use np.bincount such as:
arr_bins = np.bincount(np.concatenate(list_arr))
histo = np.array([range(len(arr_bins)),arr_bins]).T
While reading up on numpy, I encountered the function numpy.histogram().
What is it for and how does it work? In the docs they mention bins: What are they?
Some googling led me to the definition of Histograms in general. I get that. But unfortunately I can't link this knowledge to the examples given in the docs.
A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as "disjoint categories".)
The Numpy histogram function doesn't draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren't of equal width) of each bar.
In this example:
np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.
The input values are 1, 2 and 1. Therefore, bin "1 to 2" contains two occurrences (the two 1 values), and bin "2 to 3" contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).
Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:
a bar of height 0 for range/bin [0,1] on the X-axis,
a bar of height 2 for range/bin [1,2],
a bar of height 1 for range/bin [2,3].
You can plot this directly with Matplotlib (its hist function also returns the bins and the values):
>>> import matplotlib.pyplot as plt
>>> plt.hist([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]), <a list of 3 Patch objects>)
>>> plt.show()
import numpy as np
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = range(5))
Below, hist indicates that there are 0 items in bin #0, 2 in bin #1, 4 in bin #3, 1 in bin #4.
print(hist)
# array([0, 2, 4, 1])
bin_edges indicates that bin #0 is the interval [0,1), bin #1 is [1,2), ...,
bin #3 is [3,4).
print (bin_edges)
# array([0, 1, 2, 3, 4]))
Play with the above code, change the input to np.histogram and see how it works.
But a picture is worth a thousand words:
import matplotlib.pyplot as plt
plt.bar(bin_edges[:-1], hist, width = 1)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()
Another useful thing to do with numpy.histogram is to plot the output as the x and y coordinates on a linegraph. For example:
arr = np.random.randint(1, 51, 500)
y, x = np.histogram(arr, bins=np.arange(51))
fig, ax = plt.subplots()
ax.plot(x[:-1], y)
fig.show()
This can be a useful way to visualize histograms where you would like a higher level of granularity without bars everywhere. Very useful in image histograms for identifying extreme pixel values.