numpy array slicing to avoid for loop

numpy array slicing to avoid for loop - python

I am using numpy to do some calculations. In the following code:
assert(len(A.shape) == 2) # A is a 2D nparray
d1, d2 = A.shape
# want to initial G,which has the same dimension as A. And assign the last column of A to the last column of G
# initial with value 0
G = zero_likes(A)
# assign the last column to that of G
G[:, d2-1] = A[:, d2-1]
# the columns[0,dw-1] of G is the average of columns [0, dw-1] of A, based on the condition of B
for iW in range(d2-1):
n = 0
sum = 0.0
for i in range(d1):
if B[i, 0] != iW and B[i, 1] == 0:
sum += A[i, iW]
n += 1
for i in range(d1):
if B[i, 0] != iW and B[i, 1] == 0:
G[i, iW] = sum / (1.0 * n)
return G
Is there an easier way using "slicing" or "boolean array"?
Thanks!

In case you want G to have the same dimensionality as A and then change the appropriate elements of G, the following code should work:
# create G as a copy of A, otherwise you might change A by changing G
G = A.copy()
# getting the mask for all columns except the last one
m = (B[:,0][:,None] != np.arange(d2-1)[None,:]) & (B[:,1]==0)[:,None]
# getting a matrix with those elements of A which fulfills the conditions
C = np.where(m,A[:,:d2-1],0).astype(np.float)
# get the 'modified' average you use
avg = np.sum(C,axis=0)/np.sum(m.astype(np.int),axis=0)
# change the appropriate elements in all the columns except the last one
G[:,:-1] = np.where(m,avg,A[:,:d2-1])
After fiddling a long time and finding bugs ... I ended up with this code. I checked it against several random matrices A and specific choices of B
A = numpy.random.randint(100,size=(5,10))
B = np.column_stack(([4,2,1,3,4],np.zeros(5)))
and so far your and my result were in agreement.

Here's a start, focusing on the first inner loop:
In [35]: A=np.arange(12).reshape(3,4)
In [36]: B=np.array([[0,0],[1,0],[2,0]])
In [37]: sum=0
In [38]: for i in range(3):
if B[i,0]!=iW and B[i,1]==0:
sum += A[i,iW]
print(i,A[i,iW])
....:
1 4
2 8
In [39]: A[(B[:,0]!=iW)&(B[:,1]==0),iW].sum()
Out[39]: 12
I had to provide my own sample data to test this.
The 2nd loop has the same condition (B[:,0]!=iW)&(B[:,1]==0), and should work in the same way.
As one of the comments said, the dimensions of G look funny. To make things work with my sample, lets make zeros array. It looks like you are assigning to selected elements of G, the mean of a subset of A (sum/n)
In [52]: G=np.zeros_like(A)
In [53]: G[I,iW]=A[I,iW].mean()
Assuming n, the number of terms summed for each iW varies, it may be difficult to compress the outer loop into a vectorized step. If n was the same, you could pull out subset of A that matches the condition, e.g, A1, take the mean on one axis, an assign the values to G. With different numbers of terms in the sums, you still have to loop.
It just occurred to me that masked arrays might work. Mask off the terms of A that don't meet the condition, and then take the mean.
In [91]: I=(B[:,[0]]!=np.arange(4))&(B[:,[1]]==0)
In [92]: I
Out[92]:
array([[False, True, True, True],
[ True, False, True, True],
[ True, True, False, True]], dtype=bool)
In [93]: A1=np.ma.masked_array(A, ~I)
In [94]: A1
Out[94]:
masked_array(data =
[[-- 1 2 3]
[4 -- 6 7]
[8 9 -- 11]],
mask =
[[ True False False False]
[False True False False]
[False False True False]],
fill_value = 999999)
In [95]: A1.mean(0)
Out[95]:
masked_array(data = [6.0 5.0 4.0 7.0],
mask = [False False False False],
fill_value = 1e+20)
Or with plonser's where:
In [111]: np.where(I,A,0).sum(0)/I.sum(0)
Out[111]: array([ 6., 5., 4., 7.])

Related

How do I pass a list as changing condition in an array?

Let's say that I have an numpy array a = [1 2 3 4 5 6 7 8] and I want to change everything else but 1,2 and 3 to 0. With a list b = [1,2,3] a tried a[a not in b] = 0, but Python does not accept this. Currently I'm using a for loop like this:
c = a.unique()
for i in c:
if i not in b:
a[a == i] = 0
Which works very slowly (Around 900 different values in a 3D array around the size of 1000x1000x1000) and doesn't fell like the optimal solution for numpy. Is there a more optimal way doing it in numpy?

You can use numpy.isin() to create a boolean mask to use as an index:
np.isin(a, b)
# array([ True, True, True, False, False, False, False, False])
Use ~ to do the opposite:
~np.isin(a, b)
# array([False, False, False, True, True, True, True, True])
Using this to index the original array lets you assign zero to the specific elements:
a = np.array([1,2,3,4,5,6,7,8])
b = np.array([1, 2, 3])
a[~np.isin(a, b)] = 0
print(a)
# [1 2 3 0 0 0 0 0]

Finding instances similar in two lists with the same shape

I am working with a timeseries data. Let's say I have two lists of equal shape and I need to find instances where both lists have numbers greater than zero at the same position.
To break it down
A = [1,0,2,0,4,6,0,5]
B = [0,0,5,6,7,5,0,2]
We can see that in four positions, both lists have numbers greater than 0. There are other instances , but I am sure if I can get a simple code, all it needs is adjusting the signs and I can also utilize in a larger scale.
I have tried
len([1 for i in A if i > 0 and 1 for i in B if i > 0 ])
But I think the answer it's giving me is a product of both instances instead.

Since you have a numpy tag:
A = np.array([1,0,2,0,4,6,0,5])
B = np.array([0,0,5,6,7,5,0,2])
mask = ((A>0)&(B>0))
# array([False, False, True, False, True, True, False, True])
mask.sum()
# 4
A[mask]
# array([2, 4, 6, 5])
B[mask]
# array([5, 7, 5, 2])
In pure python (can be generalized to any number of lists):
A = [1,0,2,0,4,6,0,5]
B = [0,0,5,6,7,5,0,2]
mask = [all(e>0 for e in x) for x in zip(A, B)]
# [False, False, True, False, True, True, False, True]

If you want to use vanilla python, this should be doing what you are looking for
l = 0
for i in range(len(A)):
if A[i] > 0 and B[i] > 0:
l = l + 1

Create a mask from labels to compute loss with numpy

I'm having troubles creating a mask without using a for loop.
I've got a numpy array of size N with my labels and I want to create a mask of size NxN where mask[i, j] = True if and only if y[i] == y[j].
I've managed to do so by using a for loop :
mask = np.asarray([np.where(y==y[k], 1, 0) for k in range(len(y))])
But I'm working on a GPU and this greatly increase the compute time. How can I do it without looping?

This might get you started:
n = 3
a = np.arange(n)
np.equal.outer(a, a)
# this is the same as
a[:,None] == a
Output:
array([[ True, False, False],
[False, True, False],
[False, False, True]])
This is basically comparing the elements from a cartesian product. a[0] == a[1], a[1] == a[1], a[1] == a[2] and so forth, which is why the diagonal values are True when using np.arange.

You can use np.repeat and .T
a and b are just arbitrary data - the labels in your case.
import numpy as np
size = 4
a = np.arange(size)[:, None]
b = a.T
b[0, 2] = 1
c = np.repeat(a.T, repeats=size, axis=0)
d = np.repeat(b, repeats=size, axis=0).T
print(c)
print(d)
e = np.equal(c, d)
print(e)
out:
[[0 1 1 3]
[0 1 1 3]
[0 1 1 3]
[0 1 1 3]]
[[0 0 0 0]
[1 1 1 1]
[1 1 1 1]
[3 3 3 3]]
[[ True False False False]
[False True True False]
[False True True False]
[False False False True]]

For problems like these, np.indices is your friend:
dims = (len(y), len(y))
inds = np.indices(dims)
mask = np.empty(dims, dtype=bool)
mask[inds] = y[inds[0]] == y[inds[1]]
edit:
Kevin's more specific solution is more concise and almost certainly faster than this method.

Numpy getting row indices of last two elements of each column in mask

I have a boolean mask shaped (M, N). Each column in the mask may have a different number of True elements, but is guaranteed to have at least two. I want to find the row index of the last two such elements as efficiently as possible.
If I only wanted one element, I could do something like (M - 1) - np.argmax(mask[::-1, :], axis=0). However, that won't help me get the second-to-last index.
I've come up with an iterative solution using np.where or np.nonzero:
M = 4
N = 3
mask = np.array([
[False, True, True],
[True, False, True],
[True, False, True],
[False, True, False]
])
result = np.zeros((2, N), dtype=np.intp)
for col in range(N):
result[:, col] = np.flatnonzero(mask[:, col])[-2:]
This creates the expected result:
array([[1, 0, 1],
[2, 3, 2]], dtype=int64)
I would like to avoid the final loop. Is there a reasonably vectorized form of the above? I am looking for specifically two rows, which are always guaranteed to exist. A general solution for arbitrary element counts is not required.

An argsort does it -
In [9]: np.argsort(mask,axis=0,kind='stable')[-2:]
Out[9]:
array([[1, 0, 1],
[2, 3, 2]])
Another with cumsum -
c = mask.cumsum(0)
out = np.where((mask & (c>=c[-1]-1)).T)[1].reshape(-1,2).T
Specifically for exactly two rows, one way with argmax -
c = mask.copy()
idx = len(c)-c[::-1].argmax(0)-1
c[idx,np.arange(len(idx))] = 0
idx2 = len(c)-c[::-1].argmax(0)-1
out = np.vstack((idx2,idx))

Finding median of masked ndarrays representing images

I have 5 grayscale images in the form of 288x288 ndarrays. The values in each ndarray are just numpy.float32 numbers ranging from 0.0 to 255.0. For each ndarray, I've created a numpy.ma.MaskedArray object as follows:
def bool_row(row):
return [value == 183. for value in row]
mask = [bool_row(row) for row in nd_array_1]
masked_array_1 = ma.masked_array(nd_array_1, mask=mask)
The value 183. represents "garbage" in the image. All 5 images have a bit of "garbage" in them. I want to take the median of the masked images, where taking the median for each point should ignore any masked values. The result would be the correct image with no garbage.
When I try:
ma.median([masked_array_1, masked_array_2, masked_array_3, masked_array_4, masked_array_5], axis=0)
I get what seems to be the median except instead of ignoring masked values, it treats them as 183., so the result just has the superimposed garbage from all the pictures. When I just take the median of two masked images:
ma.median([masked_array_1, masked_array_2], axis=0)
It looks like it started to do the right thing, but then placed the value of 183. even where both masked arrays contain a MaskedConstant.
I could do something like the following, but I feel there's probably a way to make ma.median just behave as expected:
unmasked_array_12 = ma.median([masked_array_1, masked_array_2], axis=0)
mask = [bool_row(row) for row in unmasked_array_12]
masked_array_12 = ma.masked_array(unmasked_array_12, mask=mask)
unmasked_array_123 = ma.median([masked_array_12, masked_array_3], axis=0)
mask = [bool_row(row) for row in unmasked_array_123]
masked_array_123 = ma.masked_array(unmasked_array_123, mask=mask)
...
How do I make ma.median work as expected without resorting to the above unpleasantness?

I suspect the problem is in how ma.median handles a non-array argument. It might be converting a list to a plain numpy array, without checking the types of the elements of the list.
Consider the following example with 1-D arrays:
In [64]: a = ma.array([1, 2, -10, 3, -10, -10], mask=[0,0,1,0,1,1])
In [65]: b = ma.array([1, 2, -10, -10, 4, -10], mask=[0,0,1,1,0,1])
In [66]: a
Out[66]:
masked_array(data = [1 2 -- 3 -- --],
mask = [False False True False True True],
fill_value = 999999)
In [67]: b
Out[67]:
masked_array(data = [1 2 -- -- 4 --],
mask = [False False True True False True],
fill_value = 999999)
The following are not correct--it appears to ignore the masks:
In [68]: ma.median([a, b])
Out[68]: -4.5
In [69]: ma.median([a, b], axis=0)
Out[69]:
masked_array(data = [ 1. 2. -10. -3.5 -3. -10. ],
mask = False,
fill_value = 1e+20)
However, if I first create a new masked array using ma.array, ma.median handles it correctly:
In [70]: c = ma.array([a, b])
In [71]: c
Out[71]:
masked_array(data =
[[1 2 -- 3 -- --]
[1 2 -- -- 4 --]],
mask =
[[False False True False True True]
[False False True True False True]],
fill_value = 999999)
In [72]: ma.median(c)
Out[72]: 2.0
In [73]: ma.median(c, axis=0)
Out[73]:
masked_array(data = [1.0 2.0 -- 3.0 4.0 --],
mask = [False False True False False True],
fill_value = 1e+20)
So to fix your problem, it might be as simple as replacing this:
ma.median([masked_array_1, masked_array_2, masked_array_3, masked_array_4, masked_array_5], axis=0)
with this:
stacked = ma.array([masked_array_1, masked_array_2, masked_array_3, masked_array_4, masked_array_5])
ma.median(stacked, axis=0)

you can use the following to get rid of all of the 183 values just while calculating the median:
masked_arrays=[masked_array_1, masked_array_2, masked_array_3]
no_junk_arrays=[[x for x in masked_array if x is not 183] for masked_array in masked_arrays]
ma.median(no_junk_arrays)
For example
>>> masked_array_1 = [1,183,4]
>>> masked_array_2 = [1,183,2]
>>> masked_array_3 = [2,183,2]
>>> masked_arrays=[masked_array_1,masked_array_2,masked_array_3]
>>> no_junk_arrays=[[x for x in masked_array if x is not 183] for masked_array in masked_arrays]
>>> no_junk_arrays
[[1, 4], [1, 2], [2, 2]]

I'm sure it can be done if you find the clever sequence of numpy functions to invoke. But it can also be done naively:
def merge(a1, a2):
result = []
for x, y in zip(a1, a2):
if x == 183:
x = y
result.append(x)
return result
array_1 = [1, 183, 2]
array_2 = [1, 183, 183]
array_3 = [183, 4, 2]
print merge(merge(array_1, array_2), array_3)
If the result runs really too slowly, you can try the same code on PyPy instead of CPython.

If what you are after is fetching the non-nan value for every pixel, you could do someting along the lines of:
stacked_imgs = np.dstack((img1, img2, img3))
mask = stacked_imgs == 183
# Find the first False, i.e. non-183 entry, along stack axis
index = np.argmin(mask, axis=-1)
correct_image = stacked_image[..., index]
If all non-183 entries for a given pixel are always the same, this will give you the result you are after.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy array slicing to avoid for loop - python

Related

How do I pass a list as changing condition in an array?

Finding instances similar in two lists with the same shape

Create a mask from labels to compute loss with numpy

Numpy getting row indices of last two elements of each column in mask

Finding median of masked ndarrays representing images

Categories

Resources