Filtering histogram edges and counts - python

Consider a histogram calculation of a numpy array that returns percentages:
# 500 random numbers between 0 and 10,000
values = np.random.uniform(0,10000,500)
# Histogram using e.g. 200 buckets
perc, edges = np.histogram(values, bins=200,
weights=np.zeros_like(values) + 100/values.size)
The above returns two arrays:
perc containing the % (i.e. percentages) of values within each pair of consecutive edges[ix] and edges[ix+1] out of the total.
edges of length len(hist)+1
Now, say that I want to filter perc and edges so that I only end up with the percentages and edges for values contained within a new range [m, M]. '
That is, I want to work with the sub-arrays of perc and edges corresponding to the interval of values within [m, M]. Needless to say, the new array of percentages would still refer to the total fraction count of the input array. We just want to filter perc and edges to end up with the correct sub-arrays.
How can I post-process perc and edges to do so?
The values of m and M can be any number of course. In the example above, we can assume e.g. m = 0 and M = 200.

m = 0; M = 200
mask = [(m < edges) & (edges < M)]
>>> edges[mask]
array([ 37.4789683 , 87.07491593, 136.67086357, 186.2668112 ])
Let's work on a smaller dataset so that it is easier to understand:
np.random.seed(0)
values = np.random.uniform(0, 100, 10)
values.sort()
>>> values
array([ 38.34415188, 42.36547993, 43.75872113, 54.4883183 ,
54.88135039, 60.27633761, 64.58941131, 71.51893664,
89.17730008, 96.36627605])
# Histogram using e.g. 10 buckets
perc, edges = np.histogram(values, bins=10,
weights=np.zeros_like(values) + 100./values.size)
>>> perc
array([ 30., 0., 20., 10., 10., 10., 0., 0., 10., 10.])
>>> edges
array([ 38.34415188, 44.1463643 , 49.94857672, 55.75078913,
61.55300155, 67.35521397, 73.15742638, 78.9596388 ,
84.76185122, 90.56406363, 96.36627605])
m = 0; M = 50
mask = (m <= edges) & (edges < M)
>>> mask
array([ True, True, True, False, False, False, False, False, False,
False, False], dtype=bool)
>>> edges[mask]
array([ 38.34415188, 44.1463643 , 49.94857672])
>>> perc[mask[:-1]][:-1]
array([ 30., 0.])
m = 40; M = 60
mask = (m < edges) & (edges < M)
>>> edges[mask]
array([ 44.1463643 , 49.94857672, 55.75078913])
>>> perc[mask[:-1]][:-1]
array([ 0., 20.])

Well you might need some mathematics for this. The bins are equally spaced so you can determine which bin is the first to include and which is the last by using the width of each bin:
bin_width = edges[1] - edges[0]
Now compute the first and last valid bin:
first = math.floor((m - edges[0]) / bin_width) + 1 # How many bins from the left
last = math.floor((edges[-1] - M) / bin_width) + 1 # How many bins from the right
(Ignore the +1 for both if you want to include the bin containing m or M - but then be careful that you don't end up with negative values for first and last!)
Now you know how many bins to include:
valid_edges = edges[first:-last]
valid_perc = perc[first:-last]
This will exclude the first first points and the last last points.
Might be that I haven't payed enough attention to rounding and there is an "off by one" error included but I think the idea is sound. :-)
You probably need to catch special cases like M > edges[-1] but for readability I haven't included these.
Or if the bins are not equally spaced use boolean masks instead of the calculation:
first = edged[edges < m].size + 1
last = edged[edges > M].size + 1

Related

How to ignore specific numbers in a numpy moving average?

Let's say I have a simple numpy array:
a = np.array([1,2,3,4,5,6,7])
I can calculate the moving average of a window with size 3 simply like:
np.convolve(a,np.ones(3),'valid') / 3
which would yield
array([2., 3., 4., 5., 6.])
Now, I would like to take a moving average but exclude anytime the number '2' appears. In other words, for the first 3 numbers, originally, it would be (1 + 2 + 3) / 3 = 2. Now, I would like to do (1 + 3) / 2 = 2. How can I specify a user-defined number to ignore and calculate the running mean without including this user-defined number? I would like to keep this to some sort of numpy function without bringing in pandas.
You could replace the unwanted values with 0 using a mask and separately compute the number of valid items, then compute the ratio:
a = np.array([1,2,3,4,5,6,7])
mask = a != 2
num = np.convolve(np.where(mask, a, 0), np.ones(3), 'valid')
denom = np.convolve(mask, np.ones(3), 'valid')
out = num/denom
Output:
array([2. , 3.5, 4. , 5. , 6. ])

How to find the region of a number in a 1D array

I have no idea how to search for this question so apologies if this is a duplicate.
Anyway: I have a series of breakpoints in 1D. Let's say those breakpoints are [-1, 0, 1]. This splits the 1D space into 4 regions: x < -1, -1 <= x < 0, 0 <= x < 1, x >= 1. What I want is, given some value of x, I want to find which region it would fall in (let's say as a symbol in the alphabet).
While nested if-thens would work when there are few breakpoints, this would be cumbersome if I have many. Is there any simpler way that will work for any number of breakpoints? Numpy should have something...
Yes, we can vectorize this using numpy. The trick to find the bin of a value is to take the delta of that value with the boundary array to get a delta array, then check for the index i where this delta array is nonnegative at i but negative at i+1. More specifically, it can be done as follows
boundaries = np.array([-np.inf, -1., 0., 1., np.inf]) # put breakpoints between -np.inf and np.inf
values = np.array([-1., -0.25, 0.5, 1.]) # values whose bin you want to search
delta = values[:,None] - boundaries[None,:]
mask = (delta[:, :-1] >= 0) & (delta[:, 1:] < 0)
Running this gives the mask as
array([[False, True, False, False],
[False, True, False, False],
[False, False, True, False],
[False, False, False, True]])
where j-th row only contains one True element denoting the bin that the j-th element belongs to
To get the concrete bin boundaries simply do
left_boundary_index = np.where(mask)[1]
np.stack([boundaries[left_boundary_index], boundaries[left_boundary_index + 1]], axis=-1)
which gives
array([[-1., 0.],
[-1., 0.],
[ 0., 1.],
[ 1., inf]])
I think there need to use binary search:
def find_range(points,x):
points.sort()
low = 0
high = len(points) - 1
if x>points[high]:
return [points[high],'Inf']
if x<points[low]:
return ['-Inf',points[low]]
while high-low > 1:
mid = (low + high)//2
midVal = points[mid]
if midVal < x:
low = mid
elif midVal >= x:
high = mid
return [points[low],points[high]]
points=[-1,0,1,3,5]
print(find_range(points,6))

How to do element-wise rounding of NumPy array to first non-zero digit?

I would like to "round" (not exact a mathematical rounding) the elements of a numpy array in the following way:
Given a numpy NxN or NxM 2D array with digit between 0.00001 to 9.99999 like
a=np.array([[1.232, 1.872,2.732,0.123],
[0.0019, 0.025, 1.854, 0.00017],
[1.457, 0.0021, 2.34 , 9.99],
[1.527, 3.3, 0.012 , 0.005]]
)
I would like basically to "round" this numpy array by selecting the first non-zero digit (irregardless of the digit that follows the first non-zero digit) of each element
giving the output:
output =np.array([[1.0, 1.0, 2.0, 0.1],
[0.001, 0.02, 1.0, 0.0001],
[1.0, 0.002, 2 , 9.0],
[1, 3, 0.01 , 0.005]]
)
thanks for any help!
You could use np.logspace and np.seachsorted to determine the order of magnitude of each element and then floor divide and multiply back
po10 = np.logspace(-10,10,21)
oom = po10[po10.searchsorted(a)-1]
a//oom*oom
# array([[1.e+00, 1.e+00, 2.e+00, 1.e-01],
# [1.e-03, 2.e-02, 1.e+00, 1.e-04],
# [1.e+00, 2.e-03, 2.e+00, 9.e+00],
# [1.e+00, 3.e+00, 1.e-02, 5.e-03]])
What you would want to do is to keep a fixed number of significant figures.
This functionality is not integrated into NumPy.
To get only the 1 significant figure, you could look into either #PaulPanzer or #darcamo answers (assuming that you only have positive values).
If you want something that works a specified number of significant figures, you could use something like:
def significant_figures(arr, num=1):
# : compute the order of magnitude
order = np.zeros_like(arr)
mask = arr != 0
order[mask] = np.floor(np.log10(np.abs(arr[mask])))
del mask # free unused memory
# : compute the corresponding precision
prec = num - order - 1
return np.round(arr * 10.0 ** prec) / 10.0 ** prec
print(significant_figures(a, 1))
# [[1.e+00 2.e+00 3.e+00 1.e-01]
# [2.e-03 2.e-02 2.e+00 2.e-04]
# [1.e+00 2.e-03 2.e+00 1.e+01]
# [2.e+00 3.e+00 1.e-02 5.e-03]]
print(significant_figures(a, 2))
# [[1.2e+00 1.9e+00 2.7e+00 1.2e-01]
# [1.9e-03 2.5e-02 1.9e+00 1.7e-04]
# [1.5e+00 2.1e-03 2.3e+00 1.0e+01]
# [1.5e+00 3.3e+00 1.2e-02 5.0e-03]]
EDIT
For truncated output use np.floor() instead of np.round() just before the return.
First get the powers of 10 for each number in the array with
powers = np.floor(np.log10(a))
In your example this gives us
array([[ 0., 0., 0., -1.],
[-3., -2., 0., -4.],
[ 0., -3., 0., 0.],
[ 0., 0., -2., -3.]])
Now, if we divide the i-th element in the array by 10**power_i we essentially move each number non-zero element in the array to the first position. Now we can simple take the floor to remove the other non-zero digits and then multiply the result by 10**power_i to get back to the original scale.
The complete solution is then only the code below
powers = np.floor(np.log10(a))
10**powers * np.floor(a/10**powers)
What about numbers greater than or equal to 10?
For this you can simply take np.floor of the original value in the array. We can do this easily with a mask. You can modify the answer as below
powers = np.floor(np.log10(a))
result = 10**powers * np.floor(a/10**powers)
mask = a >= 10
result[mask] = np.floor(a[mask])
You can also use a mask to avoid computing the powers and logarithm for numbers that will just be replaced later.

Calculating wind divergence of u and v using Python, np.gradient

I'm very new to Python and currently trying to replicate plots etc that I previously used GrADs for. I want to calculate the divergence at each grid box using u and v wind fields (which are just scaled by specific humidity, q), from a netCDF climate model file.
From endless searching I know I need to use some combination of np.gradient and np.sum, but can't find the right combination. I just know that to do it 'by hand', the calculation would be
divg = dqu/dx + dqv/dy
I know the below is wrong, but it's the best I've got so far...
nc = Dataset(ifile)
q = np.array(nc.variables['hus'][0,:,:])
u = np.array(nc.variables['ua'][0,:,:])
v = np.array(nc.variables['va'][0,:,:])
lon=nc.variables['lon'][:]
lat=nc.variables['lat'][:]
qu = q*u
qv = q*v
dqu/dx, dqu/dy = np.gradient(qu, [dx, dy])
dqv/dx, dqv/dy = np.gradient(qv, [dx, dy])
divg = np.sum(dqu/dx, dqv/dy)
This gives the error 'SyntaxError: can't assign to operator'.
Any help would be much appreciated.
try something like:
dqu_dx, dqu_dy = np.gradient(qu, [dx, dy])
dqv_dx, dqv_dy = np.gradient(qv, [dx, dy])
you can not assign to any operation in python; any of those are syntax errors:
a + b = 3
a * b = 7
# or, in your case:
a / b = 9
UPDATE
following Pinetwig's comment: a/b is not a valid identifier name; it is (the return value of) an operator.
Try removing the [dx, dy].
[dqu_dx, dqu_dy] = np.gradient(qu)
[dqv_dx, dqv_dy] = np.gradient(qv)
Also to point out if you are recreating plots. Gradient changed in numpy between 1.82 and 1.9. This had an effect for recreating matlab plots in python as 1.82 was the matlab method. I am not sure how this relates to GrADs. Here is the wording for both.
1.82
"The gradient is computed using central differences in the interior
and first differences at the boundaries. The returned gradient hence has
the same shape as the input array."
1.9
"The gradient is computed using second order accurate central differences in the interior and either first differences or second order accurate one-sides (forward or backwards) differences at the boundaries. The returned gradient hence has the same shape as the input array."
The gradient function for 1.82 is here.
def gradient(f, *varargs):
"""
Return the gradient of an N-dimensional array.
The gradient is computed using central differences in the interior
and first differences at the boundaries. The returned gradient hence has
the same shape as the input array.
Parameters
----------
f : array_like
An N-dimensional array containing samples of a scalar function.
`*varargs` : scalars
0, 1, or N scalars specifying the sample distances in each direction,
that is: `dx`, `dy`, `dz`, ... The default distance is 1.
Returns
-------
gradient : ndarray
N arrays of the same shape as `f` giving the derivative of `f` with
respect to each dimension.
Examples
--------
>>> x = np.array([1, 2, 4, 7, 11, 16], dtype=np.float)
>>> np.gradient(x)
array([ 1. , 1.5, 2.5, 3.5, 4.5, 5. ])
>>> np.gradient(x, 2)
array([ 0.5 , 0.75, 1.25, 1.75, 2.25, 2.5 ])
>>> np.gradient(np.array([[1, 2, 6], [3, 4, 5]], dtype=np.float))
[array([[ 2., 2., -1.],
[ 2., 2., -1.]]),
array([[ 1. , 2.5, 4. ],
[ 1. , 1. , 1. ]])]
"""
f = np.asanyarray(f)
N = len(f.shape) # number of dimensions
n = len(varargs)
if n == 0:
dx = [1.0]*N
elif n == 1:
dx = [varargs[0]]*N
elif n == N:
dx = list(varargs)
else:
raise SyntaxError(
"invalid number of arguments")
# use central differences on interior and first differences on endpoints
outvals = []
# create slice objects --- initially all are [:, :, ..., :]
slice1 = [slice(None)]*N
slice2 = [slice(None)]*N
slice3 = [slice(None)]*N
otype = f.dtype.char
if otype not in ['f', 'd', 'F', 'D', 'm', 'M']:
otype = 'd'
# Difference of datetime64 elements results in timedelta64
if otype == 'M' :
# Need to use the full dtype name because it contains unit information
otype = f.dtype.name.replace('datetime', 'timedelta')
elif otype == 'm' :
# Needs to keep the specific units, can't be a general unit
otype = f.dtype
for axis in range(N):
# select out appropriate parts for this dimension
out = np.empty_like(f, dtype=otype)
slice1[axis] = slice(1, -1)
slice2[axis] = slice(2, None)
slice3[axis] = slice(None, -2)
# 1D equivalent -- out[1:-1] = (f[2:] - f[:-2])/2.0
out[slice1] = (f[slice2] - f[slice3])/2.0
slice1[axis] = 0
slice2[axis] = 1
slice3[axis] = 0
# 1D equivalent -- out[0] = (f[1] - f[0])
out[slice1] = (f[slice2] - f[slice3])
slice1[axis] = -1
slice2[axis] = -1
slice3[axis] = -2
# 1D equivalent -- out[-1] = (f[-1] - f[-2])
out[slice1] = (f[slice2] - f[slice3])
# divide by step size
outvals.append(out / dx[axis])
# reset the slice object in this dimension to ":"
slice1[axis] = slice(None)
slice2[axis] = slice(None)
slice3[axis] = slice(None)
if N == 1:
return outvals[0]
else:
return outvals
If your grid is Gaussian and the wind names in the file are "u" and "v" you can also calculate divergence directly using cdo:
cdo uv2dv in.nc out.nc
See https://code.mpimet.mpg.de/projects/cdo/embedded/index.html#x1-6850002.13.2 for more details.

linspace that would always include the final point?

For arbitrary pair of 2D points in the plane, I want to break the connecting vector to parts specified by a precision factor. However I want it to always include the start and endpoint. As an extra feature I am expecting the segmenting from the end of the vector to the beginning would give me the same segmentation from the beginning to end(of course after a flipping) . As I can see, numpy.linspace naturally satisfies this condition except for the situations where
the precision is too big that it only consists of one point. Is there any built-in function to take care of this situation or any hints that I would be able to correct this behaviour?
import numpy as np
alpha = np.array([0,0])
beta = np.array([1,1])
alpha_beta_dist = np.linalg.norm(beta - alpha)
for i in range(10):
precision = np.random.random(1)
traversal = np.linspace(0.0, 1.0, num = alpha_beta_dist / float(precision))
traversal2 = np.fliplr([np.linspace(1.0, 0.0, num = alpha_beta_dist / float(precision))])
traversal2 = traversal2[0]
if (traversal != traversal2).all():
print 'precision: ', precision
print 'taversal: ', traversal
print 'taversal2: ', traversal2[0]
Make sure num is at least 2:
traversal = np.linspace(0.0, 1.0,
num=max(alpha_beta_dist/float(precision), 2))
np.linspace will return both endpoints (by default) unless num is less than 2:
In [23]: np.linspace(0, 1, num=0)
Out[23]: array([], dtype=float64)
In [24]: np.linspace(0, 1, num=1)
Out[24]: array([ 0.])
In [25]: np.linspace(0, 1, num=2)
Out[25]: array([ 0., 1.])

Categories