Mask Array with boolean Comparison to Smaller dimension Array Python - python

first time asking a question and I'm also really bad at coding. I have a 2D array shape=(100,4000) and want to manipulate all values in each row that are greater than the corresponding value in an 1D array shape=(100,).
I want to do this in the most efficient way as it will soon be an array with shape=(1000,8000) potentially. Not sure how to broadcast in a boolean mask or if I even can? Below is an example of my code that is obviously not working. I could do what I want by using a for loop or by duplicating the LX array to be the shape=(100,4000) but I want it to be faster than these options.
dX = np.random.rand(100,4000)
LX = np.random.rand(100)
LX2 = LX/2
dX[dX>LX2] = dX[dX>LX2]-LX[dX>LX2]

Related

Is there a faster method to use a 2d numpy array of booleans to select elements from a 2d array, but with a 2d output?

If I have an array like this
arr=np.array([['a','b','c'],
['d','e','f']])
and an array of booleans of the same shape, like this:
boolarr=np.array([[False,True,False],
[True,True,True]])
I want to be able to only select the elements from the first array, that correspond to a True in the boolean array. So the output would be:
out=[['b'],
['d','e','f']]
I managed to solve this with a simple for loop
out=[]
for n, i in enumerate(arr):
out.append(i[boolarr[n]])
out=np.array(out)
but the problem is this solution is slow for large arrays, and was wondering if there was an easier solution with numpys indexing. Just using the normal notation arr[boolarr] returns a single flat array ['b','d','e','f']. I also tried using a slice with arr[:,[True,False,True]], which keeps the shape but can only use one boolean array.
Thanks for the comments. I misunderstood how an array works. For those curious this is my solution (I'm actually working with numbers):
arr[boolarr]=np.nan
And then I just changed how the rest of the function handles nan values

How to opperate each cell of 2D array/matrix?

Hi I'm reading two rasters A & B as arrays.
What I'm looking for is to make an opperation to certain cells within two 2D arrays (two rasters). I need to subtract -3.0 to the cells in one array (A) that are greater than the other cells within the 2D array (B).
All the other cells don't need to change, so my answer will be the 2D array (B) with some changed cells that fit that condition and the other 2D array (A) untouched.
I tried this but doesn't seem to work (also takes TOO long):
A = Raster_A.GetRasterBand(1).ReadAsArray()
B = Raster_B.GetRasterBand(1).ReadAsArray()
A = array([ 917.985028, 916.284480, 918.525323, 920.709505,
921.835315, 922.328555, 920.283029, 922.229594,
922.928670, 925.315534, 922.280360, 922.715303,
925.933969, 925.897328, 923.880606, 923.864701])
B = array([ 913.75785758, 914.45941854, 915.17586919, 915.90724705,
916.6534542 , 917.4143068 , 918.18957846, 918.97902532,
919.78239295, 920.59941086, 921.42978108, 922.27316565,
923.12917544, 923.99736194, 924.87721232, 925.76814782])
for i in np.nditer(A, op_flags=['readwrite']):
for j in np.nditer(B, op_flags=['readwrite']):
if j[...] > i[...]:
B = j[...]-3.0
So the answer, the array B should be something like:
B = array([ 913.75785758, 914.45941854, 915.17586919, 915.90724705,
916.6534542 , 917.4143068 , 918.18957846, 918.97902532,
919.78239295, 920.59941086, 921.42978108, 922.27316565,
923.12917544, 923.99736194, 921.87721232, 922.76814782])
Please notice the two bottom right values :)
I'm a bit dizzy already from trying and doing other stuff at the same time so I apologize if I did any stupidity right there, any suggestion is greatly appreciated. Thanks!
Based on your example, I conclude that you want to subtract values from the array B. This can be done via
B[A<B] -= 3
The "mask" A<B is a boolean array that is true at all the values that you want to change. Now, B[A<B] returns a view to exactly these values. Finally, B[A<B] -= 3 changes all these values in place.
It is crucial that you use the inplace operator -=, because otherwise a new array will be created that contain only the values where A<B. Thereby, the array is flattened, i.e. looses its shape, and you do not want that.
Regarding speed, avoid for loops as much as you can when working with numpy. Fancy indexing and slicing offers you very neat (and super fast) options to work with your data. Maybe have a look here and here.

Use scipy.sparse.dok_matrix.setdiag on only a part of a dok_matrix object

Let's suppose I have a (very large) sparse matrix being a scipy.sparse.dok_matrix object. I want to set the diagonal of only a submatrix to certain value. I first thought something like this would work:
import scipy.sparse as sp
dim = 20 # dim can go up to large numbers
A = sp.dok_matrix((num, num))
A[num//2:-1,num//2:-1].setdiag(2)
, but this only leads to an empty matrix (because of the way the matrix is internally stored using arrays, I suppose?). I know that for this small example I could use setdiag on the whole matrix and plug in an array with zeros at the beginning, but this won't be sufficent for larger matrix dimensions as the array would get too big.
I also tried:
A[num//2:-1,num//2:-1] = 2*sp.eye((num-1)//2)
This does what I want it to do, but much too slow. Is there a way to get to the same result faster (i.e. whithout setting all the entries of the submatrix explicitly)?

Increasing performance of highly repeated numpy array index operations

In my program code I've got numpy value arrays and numpy index arrays. Both are preallocated and predefined during program initialization.
Each part of the program has one array values on which calculations are performed, and three index arrays idx_from_exch, idx_values and idx_to_exch. There is on global value array to exchange the values of several parts: exch_arr.
The index arrays have between 2 and 5 indices most of the times, seldomly (most probably never) more indices are needed. dtype=np.int32, shape and values are constant during the whole program run. Thus I set ndarray.flags.writeable=False after initialization, but this is optional. The index values of the index arrays idx_values and idx_to_exch are sorted in numerical order, idx_source may be sorted, but there is no way to define that. All index arrays corresponding to one value array/part have the same shape.
The values arrays and also the exch_arr usually have between 50 and 1000 elements. shape and dtype=np.float64 are constant during the whole program run, the values of the arrays change in each iteration.
Here are the example arrays:
import numpy as np
import numba as nb
values = np.random.rand(100) * 100 # just some random numbers
exch_arr = np.random.rand(60) * 3 # just some random numbers
idx_values = np.array((0, 4, 55, -1), dtype=np.int32) # sorted but varying steps
idx_to_exch = np.array((7, 8, 9, 10), dtype=np.int32) # sorted and constant steps!
idx_from_exch = np.array((19, 4, 7, 43), dtype=np.int32) # not sorted and varying steps
The example indexing operations look like this:
values[idx_values] = exch_arr[idx_from_exch] # get values from exchange array
values *= 1.1 # some inplace array operations, this is just a dummy for more complex things
exch_arr[idx_to_exch] = values[idx_values] # pass some values back to exchange array
Since these operations are being applied once per iteration for several million iterations, speed is crucial. I've been looking into many different ways of increasing indexing speed in my previous question, but forgot to be specific enough considering my application (especially getting values by indexing with constant index arrays and passing them to another indexed array).
The best way to do it seems to be fancy indexing so far. I'm currently also experimenting with numba guvectorize, but it seems that it is not worth the effort since my arrays are quite small.
memoryviews would be nice, but since the index arrays do not necessarily have consistent steps, I know of no way to use memoryviews.
So is there any faster way to do repeated indexing? Some way of predefining memory address arrays for each indexing operation, as dtype and shape are always constant? ndarray.__array_interface__ gave me a memory address, but I wasn't able to use it for indexing. I thought about something like:
stride_exch = exch_arr.strides[0]
mem_address = exch_arr.__array_interface__['data'][0]
idx_to_exch = idx_to_exch * stride_exch + mem_address
Is that feasible?
I've also been looking into using strides directly with as_strided, but as far as I know only consistent strides are allowed and my problem would require inconsistent strides.
Any help is appreciated!
Thanks in advance!
edit:
I just corrected a massive error in my example calculation!
The operation values = values * 1.1 changes the memory address of the array. All my operations in the program code are layed out to not change the memory address of the arrays, because alot of other operations rely on using memoryviews. Thus I replaced the dummy operation with the correct in-place operation: values *= 1.1
One solution to getting round expensive fancy indexing using numpy boolean arrays is using numba and skipping over the False values in your numpy boolean array.
Example implementation:
#numba.guvectorize(['float64[:], float64[:,:], float64[:]'], '(n),(m,n)->(m)', nopython=True, target="cpu")
def test_func(arr1, arr2, inds, res):
for i in range(arr1.shape[0]):
if not inds[i]:
continue
for j in range(arr2.shape[0]):
res[j, i] = arr1[i] + arr2[j, i]
Of course, play around with the numpy data types (smaller byte sizes will run faster) and target being "cpu" or "parallel".

How to keep track of original row indices in Numpy array when comparing to only a slice?

I'm working with a 2D numpy array A, performing a comparison of a one dimensional array, X, against each row in A. As approximate matches are found, I'm keeping track of their indices in A in a dtype=bool array S. I'd like to use S to shrink the field of match candidates in A to improve efficiency. Here's the basic idea in code:
def compare(nxt):
S[nxt] = 0 #sets boolean
T = A[nxt, i:] == A[S, :-i] #T has different dimesions than A
compare() is iterated over and S is progressively populated with False values.
The problem is that the boolean array T is of the same dimensions as the pared down version of A not the original version. I'm hoping to use T to get the indices (in the unsliced A) of the approximate matches for later use.
np.argwhere(T)
This returns a list of indices of the matches, but again in the slice of A.
It seems like there has to be a better way to, at the same time, crop A for more efficient searching and still be able to get the correct index of the matching row.
Any thoughts?

Categories