How to find the region of a number in a 1D array - python

I have no idea how to search for this question so apologies if this is a duplicate.
Anyway: I have a series of breakpoints in 1D. Let's say those breakpoints are [-1, 0, 1]. This splits the 1D space into 4 regions: x < -1, -1 <= x < 0, 0 <= x < 1, x >= 1. What I want is, given some value of x, I want to find which region it would fall in (let's say as a symbol in the alphabet).
While nested if-thens would work when there are few breakpoints, this would be cumbersome if I have many. Is there any simpler way that will work for any number of breakpoints? Numpy should have something...

Yes, we can vectorize this using numpy. The trick to find the bin of a value is to take the delta of that value with the boundary array to get a delta array, then check for the index i where this delta array is nonnegative at i but negative at i+1. More specifically, it can be done as follows
boundaries = np.array([-np.inf, -1., 0., 1., np.inf]) # put breakpoints between -np.inf and np.inf
values = np.array([-1., -0.25, 0.5, 1.]) # values whose bin you want to search
delta = values[:,None] - boundaries[None,:]
mask = (delta[:, :-1] >= 0) & (delta[:, 1:] < 0)
Running this gives the mask as
array([[False, True, False, False],
[False, True, False, False],
[False, False, True, False],
[False, False, False, True]])
where j-th row only contains one True element denoting the bin that the j-th element belongs to
To get the concrete bin boundaries simply do
left_boundary_index = np.where(mask)[1]
np.stack([boundaries[left_boundary_index], boundaries[left_boundary_index + 1]], axis=-1)
which gives
array([[-1., 0.],
[-1., 0.],
[ 0., 1.],
[ 1., inf]])

I think there need to use binary search:
def find_range(points,x):
points.sort()
low = 0
high = len(points) - 1
if x>points[high]:
return [points[high],'Inf']
if x<points[low]:
return ['-Inf',points[low]]
while high-low > 1:
mid = (low + high)//2
midVal = points[mid]
if midVal < x:
low = mid
elif midVal >= x:
high = mid
return [points[low],points[high]]
points=[-1,0,1,3,5]
print(find_range(points,6))

Related

Calculate mean value for each pixel of a sum of Xarray DataArrays

I am trying to calculate a fog frequency map based on a number of geoTIFFs that I have read as Xarray DataArrays using the rioxarray.open_rasterio function in Python 3.10. Each "pixel" can have one of the following values: 1 = fog, 0 = no fog, -9999 = no data. The end goal is to calculate a new DataArray that contains the ratio of the number "fog" pixels to the number of pixel with either "fog" or "no fog".
For this I want to write a for-loop that creates the sum of "fog" and "no_fog" entries per pixel while excluding the "no data" pixels. Then it should divide the pixel values of the sum DataArray by the number of pixels that were used in the calculation of each individual sum. So, if for a single pixel there are the following values: 0, 1, 1, -9999, 0, and -9999 the loop should create a sum of 2 and divide it by 4, creating a fog frequency of 0.5 or 50%.
So far, I have only been able to calculate the sum of all input DataArrays, without excluding the "no data" pixels using this code:
# open all fog maps and create a list:
folder = "E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency"
list_of_maps = glob.glob(folder + '/fog_map*.tif', recursive=True) # all files that start with "fog_map"
# make list with all different filenames (dates) in this folder:
maps = [] # initialize empty list for all file names
for i in range(0, np.size(list_of_maps)):
# files naming convention "fog_map_YYYYMMDD_HHMMSS.tif":
maps.append(list_of_maps[i].split('fog_map_')[1][0:8])
# find out how many dates are in the folder:
maps = np.unique(maps) # remove duplicates from array
print(maps)
print('\ndata from {} different dates in this folder\n'.format(np.size(maps)))
# create fog_sum xarray dataArray to have something to start out with and later subtract it again:
fog_sum = rioxarray.open_rasterio("E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency/fog_map_20210601.tif")
fog_sum_subtract = rioxarray.open_rasterio("E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency/fog_map_20210601.tif")
# add all fog maps:
for i in range(0, np.size(list_of_maps)):
# open data sets:
fog_map = rioxarray.open_rasterio(list_of_maps[i], engine='rasterio')
# fog_map = fog_map.where(fog_map >= 0)
fog_sum = fog_sum + fog_map
# subtract original fog map and export as geoTIFF:
fog_sum = fog_sum - fog_sum_subtract
fog_sum.rio.to_raster("E:/Jasper/Studium/BA_Thesis/MODIS_data/MODIS_2021_data/2021_06/fog_frequency/fog_sum.tif",
driver="GTiff")
I tried to exclude the "no data" values using fog_map = fog_map.where(fog_map >= 0), but this left me with a fog_sum geoTIFF, where each pixel had the value 1.79769e+308
This is an example of what the output of a fog_map_YYYYMMDD.tif DataArray looks like, before applying the fog_map.where(fog_map >= 0) function:
<xarray.DataArray (band: 1, y: 412, x: 388)>
[159856 values with dtype=float32]
Coordinates:
* band (band) int32 1
* x (x) float64 -92.49 -92.49 -92.48 ... -89.02 -89.02 -89.01
* y (y) float64 2.0 1.991 1.982 1.973 ... -1.674 -1.683 -1.692
spatial_ref int32 0
Attributes:
STATISTICS_MAXIMUM: 1
STATISTICS_MEAN: -858.62379891903
STATISTICS_MINIMUM: -9999
STATISTICS_STDDEV: 2801.4551987932
STATISTICS_VALID_PERCENT: 100
scale_factor: 1.0
add_offset: 0.0
And after applying the function:
<xarray.DataArray (band: 1, y: 412, x: 388)>
array([[[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., nan, 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., nan, nan, nan],
[ 0., 0., 0., ..., nan, nan, nan],
[ 0., 0., 0., ..., nan, nan, nan]]], dtype=float32)
Coordinates:
* band (band) int32 1
* x (x) float64 -92.49 -92.49 -92.48 ... -89.02 -89.02 -89.01
* y (y) float64 2.0 1.991 1.982 1.973 ... -1.674 -1.683 -1.692
spatial_ref int32 0
Attributes:
STATISTICS_MAXIMUM: 1
STATISTICS_MEAN: -858.62379891903
STATISTICS_MINIMUM: -9999
STATISTICS_STDDEV: 2801.4551987932
STATISTICS_VALID_PERCENT: 100
scale_factor: 1.0
add_offset: 0.0
Any help is greatly appreciated!
If your data is this small it’s probably fastest and easiest to concat all the data, drop the -9999 values using the da.where() code you suggest, then just take the mean over the concatenated dimension:
fog_maps = []
for i in range(0, np.size(list_of_maps)):
# open data sets:
fog_map = rioxarray.open_rasterio(list_of_maps[i], engine='rasterio')
fog_maps.append(fog_map)
all_fog = xr.concat(fog_maps, dim="file")
all_fog = all_fog.where(all_fog >= 0)
mean_fog = all_fog.mean(dim="file")
Your data is already float (though maybe it shouldn’t be?) so replacing -9999 with np.nan (a float) isn’t hurting you.
But if you’re facing memory constraints and want to accumulate the stats iteratively I’d use xarray.where(da >= 0, 1, 0) to accumulate your denominator:
# add all fog maps:
data_count = 0
fog_sum = 0
for i in range(0, np.size(list_of_maps)):
# open data sets:
fog_map = rioxarray.open_rasterio(list_of_maps[i], engine='rasterio')
data_count += xr.where(fog_map >= 0), 1, 0)
fog_sum += fog_map.where(fog_map > 0, 0)
fog_mean = (fog_sum / data_count).where(data_count > 0)
xr.where is like xr.DataArray.where but you define the value returned if true explicitly.
As an aside, the value you were seeing, 1.797e308, is the largest value that can be held by a float64. So you’re probably just looking at the GeoTiff encoding of np.inf after a divide by zero problem. Make sure to mask the data where your denominator is zero and handle it the way you’re intending for locations where there is never any valid data.

Conditional Loop with numpy arrays

I am trying to implement the following simple condition with numpy arrays, but the output is wrong.
dt = 1.0
t = np.arange(0.0, 5.0, dt)
x = np.empty_like(t)
if np.where((t >=0) & (t < 3)):
x = 2*t
else:
x=4*t
I get the output below
array([0., 2., 4., 6., 8.])
But I am expecting
array([0., 2., 4., 12., 16.])
Thanks for your help!
Looking in the docs for np.where:
Note: When only condition is provided, this function is a shorthand for
np.asarray(condition).nonzero(). Using nonzero directly should be
preferred, as it behaves correctly for subclasses. The rest of this
documentation covers only the case where all three arguments are
provided.
Since you don't provide the x and y arguments, where acts like nonzero.
nonzero returns a tuple of np.arrays, which is truthy when converted to bool. So your code ends up evaluating as:
if True:
x = 2*t
Instead, you want to use:
x = np.where((t >= 0) & (t < 3), 2*t, 4*t)
The usage of np.where is different
dt = 1.0
t = np.arange(0.0, 5.0, dt)
x = np.empty_like(t)
x = np.where((t >= 0) & (t < 3), 2*t, 4*t)
x
Output
[ 0., 2., 4., 12., 16.]
in your code the if statement is not necessary and causes the problem.
np.where() creates the condition therefore you do not need the if statement.
Here is a working example of your code with the output you want
dt = 1.0
t = np.arange(0.0, 5.0, dt)
x = np.empty_like(t)
np.where((t >=0) & (t < 3),2*t,4*t)

sum a 3x3 array on a given point to another matrix maintaining boundaries

suppose I have this 2d array A:
[[0,0,0,0],
[0,0,0,0],
[0,0,0,0],
[0,0,0,4]]
and I want to sum B:
[[1,2,3]
[4,5,6]
[7,8,9]]
centered on A[0][0] so the result would be:
array_sum(A,B,0,0) =
[[5,6,0,4],
[8,9,0,0],
[0,0,0,0],
[2,0,0,5]]
I was thinking that I should make a function that compares if its on a boundary and then adjust the index for that:
def array_sum(A,B,i,f):
...
if i == 0 and j == 0:
A[-1][-1] = A[-1][-1]+B[0][0]
...
else:
A[i-1][j-1] = A[i][j]+B[0][0]
A[i][j] = A[i][j]+B[1][1]
A[i+1][j+1] = A[i][j]+B[2][2]
...
but I don't know if there is a better way of doing that, I was reading about broadcasting or maybe using convolute for that, but I'm not sure if there is a better way to do that.
Assuming B.shape is all odd numbers, you can use np.indices, manipulate them to point where you want, and use np.add.at
def array_sum(A, B, loc = (0, 0)):
A_ = A.copy()
ix = np.indices(B.shape)
new_loc = np.array(loc) - np.array(B.shape) // 2
new_ix = np.mod(ix + new_loc[:, None, None],
np.array(A.shape)[:, None, None])
np.add.at(A_, tuple(new_ix), B)
return A_
Testing:
array_sum(A, B)
Out:
array([[ 5., 6., 0., 4.],
[ 8., 9., 0., 7.],
[ 0., 0., 0., 0.],
[ 2., 3., 0., 5.]])
As a rule of thumb slice indexing is faster (~2x) than fancy indexing. This appears to be true even for the small example in OP. Downside: the code is slightly more complicated.
import numpy as np
from numpy import s_ as _
from itertools import product, starmap
def wrapsl1d(N, n, c):
# check in 1D whether a patch of size n centered at c in a vector
# of length N fits or has to be wrapped around
# return appropriate slice objects for both vector and patch
assert n <= N
l = (c - n//2) % N
h = l + n
# return list of pairs (index into A, index into patch)
# 2 pairs if we wrap around, otherwise 1 pair
return [_[l:h, :]] if h <= N else [_[l:, :N-l], _[:h-N, n+N-h:]]
def use_slices(A, patch, center=(0, 0)):
slAptch = product(*map(wrapsl1d, A.shape, patch.shape, center))
# the product now has elements [(idx0A, idx0ptch), (idx1A, idx1ptch)]
# transpose them:
slAptch = starmap(zip, slAptch)
out = A.copy()
for sa, sp in slAptch:
out[sa] += patch[sp]
return out

How using "dot" (or "matmul") function for iterative multiplication in Python

I need obtain a "W" matrix of multiples matrix multiplications (all multiplications result in column vectors).
from numpy import matrix
from numpy import transpose
from numpy import matmul
from numpy import dot
# Iterative matrix multiplication
def iterativeMultiplication(X, Y):
W = [] # Matrix of matricial products
X = matrix(X) # same number of rows
Y = matrix(Y) # same number of rows
h = 0
while (h < X.shape[1]):
W.append([])
W[h] = dot(transpose(X), Y) # using "dot" function
h += 1
return W
But, unexpectedly, I obtain a list of objects with their respective data types.
X = [[0., 0., 1.], [1.,0.,0.], [2.,2.,2.], [2.,5.,4.]]
Y = [[-0.2], [1.1], [5.9], [12.3]] # Edit Y column
iterativeMultiplication( X, Y )
Results in:
[array([[37.5],[73.3],[60.8]]),
array([[37.5],[73.3],[60.8]]),
array([[37.5],[73.3],[60.8]])]
I need any method for obtain only the numerical values for the matrix conversion.
W = matrix(W) # Results in error
It is the same using "matmul" function. Thx for your time.
If you want to stack multiple matrices, you can use numpy.vstack:
W = numpy.vstack(W)
Edit: There seems to be a discrepancy between your function, X and Y versus the "result" list in your question. But based on your comments below, what you're actually looking for is numpy.hstack (horizontal stack) which will give you the desired 3x3 matrix based on your "result" list.
W = numpy.hstack(W)
Of course you are going to get a list. You initial W as a list, and append the same calculation to it 3 times.
But your 3 element arrays don't make sense with this data, array([[ 3.36877336],[ 3.97112615],[ 3.8092797 ]]).
If I make Xm=np.matrix(X), etc:
In [162]: Xm
Out[162]:
matrix([[ 0., 0., 1.],
[ 1., 0., 0.],
[ 2., 2., 2.],
[ 2., 5., 4.]])
In [163]: Ym
Out[163]:
matrix([[ 0.1, -0.2],
[ 0.9, 1.1],
[ 6.2, 5.9],
[ 11.9, 12.3]])
In [164]: Xm.T.dot(Ym)
Out[164]:
matrix([[ 37.1, 37.5],
[ 71.9, 73.3],
[ 60.1, 60.8]])
In [165]: Xm.T*Ym # matrix interprets * as .dot
Out[165]:
matrix([[ 37.1, 37.5],
[ 71.9, 73.3],
[ 60.1, 60.8]])
You need to edit the question, to have both valid Python code (missing def and :), and results that match the inputs.
===============
In [173]: Y = [[-0.2], [1.1], [5.9], [12.3]]
In [174]: Ym=np.matrix(Y)
Out[176]:
matrix([[ 37.5],
[ 73.3],
[ 60.8]])
=====================
This iteration is clumsy:
h = 0
while (h < X.shape[1]):
W.append([])
W[h] = dot(transpose(X), Y) # using "dot" function
h += 1
A more Pythonic approach
for h in range(X.shape[1]):
W.append(np.dot(...))
Or even
W = [np.dot(....) for h in range(X.shape[1])]

Filtering histogram edges and counts

Consider a histogram calculation of a numpy array that returns percentages:
# 500 random numbers between 0 and 10,000
values = np.random.uniform(0,10000,500)
# Histogram using e.g. 200 buckets
perc, edges = np.histogram(values, bins=200,
weights=np.zeros_like(values) + 100/values.size)
The above returns two arrays:
perc containing the % (i.e. percentages) of values within each pair of consecutive edges[ix] and edges[ix+1] out of the total.
edges of length len(hist)+1
Now, say that I want to filter perc and edges so that I only end up with the percentages and edges for values contained within a new range [m, M]. '
That is, I want to work with the sub-arrays of perc and edges corresponding to the interval of values within [m, M]. Needless to say, the new array of percentages would still refer to the total fraction count of the input array. We just want to filter perc and edges to end up with the correct sub-arrays.
How can I post-process perc and edges to do so?
The values of m and M can be any number of course. In the example above, we can assume e.g. m = 0 and M = 200.
m = 0; M = 200
mask = [(m < edges) & (edges < M)]
>>> edges[mask]
array([ 37.4789683 , 87.07491593, 136.67086357, 186.2668112 ])
Let's work on a smaller dataset so that it is easier to understand:
np.random.seed(0)
values = np.random.uniform(0, 100, 10)
values.sort()
>>> values
array([ 38.34415188, 42.36547993, 43.75872113, 54.4883183 ,
54.88135039, 60.27633761, 64.58941131, 71.51893664,
89.17730008, 96.36627605])
# Histogram using e.g. 10 buckets
perc, edges = np.histogram(values, bins=10,
weights=np.zeros_like(values) + 100./values.size)
>>> perc
array([ 30., 0., 20., 10., 10., 10., 0., 0., 10., 10.])
>>> edges
array([ 38.34415188, 44.1463643 , 49.94857672, 55.75078913,
61.55300155, 67.35521397, 73.15742638, 78.9596388 ,
84.76185122, 90.56406363, 96.36627605])
m = 0; M = 50
mask = (m <= edges) & (edges < M)
>>> mask
array([ True, True, True, False, False, False, False, False, False,
False, False], dtype=bool)
>>> edges[mask]
array([ 38.34415188, 44.1463643 , 49.94857672])
>>> perc[mask[:-1]][:-1]
array([ 30., 0.])
m = 40; M = 60
mask = (m < edges) & (edges < M)
>>> edges[mask]
array([ 44.1463643 , 49.94857672, 55.75078913])
>>> perc[mask[:-1]][:-1]
array([ 0., 20.])
Well you might need some mathematics for this. The bins are equally spaced so you can determine which bin is the first to include and which is the last by using the width of each bin:
bin_width = edges[1] - edges[0]
Now compute the first and last valid bin:
first = math.floor((m - edges[0]) / bin_width) + 1 # How many bins from the left
last = math.floor((edges[-1] - M) / bin_width) + 1 # How many bins from the right
(Ignore the +1 for both if you want to include the bin containing m or M - but then be careful that you don't end up with negative values for first and last!)
Now you know how many bins to include:
valid_edges = edges[first:-last]
valid_perc = perc[first:-last]
This will exclude the first first points and the last last points.
Might be that I haven't payed enough attention to rounding and there is an "off by one" error included but I think the idea is sound. :-)
You probably need to catch special cases like M > edges[-1] but for readability I haven't included these.
Or if the bins are not equally spaced use boolean masks instead of the calculation:
first = edged[edges < m].size + 1
last = edged[edges > M].size + 1

Categories